Income Prediction Project: US Census Data Analysis¶
1. Understanding the Problem and Data¶
This project focuses on building a machine learning model to predict whether an individual earns more or less than $50,000 per year based on census data. We're working with a binary classification problem using data from the US Census that contains information for about 300,000 individuals.
Let's start by importing the necessary libraries and loading our data.
# !pip list
Package Version ----------------------- ----------- asttokens 3.0.0 cloudpickle 3.1.1 colorama 0.4.6 comm 0.2.2 contourpy 1.3.1 cycler 0.12.1 debugpy 1.8.13 decorator 5.2.1 executing 2.2.0 fonttools 4.56.0 ipykernel 6.29.5 ipython 9.0.2 ipython_pygments_lexers 1.1.1 jedi 0.19.2 joblib 1.4.2 jupyter_client 8.6.3 jupyter_core 5.7.2 kiwisolver 1.4.8 llvmlite 0.44.0 matplotlib 3.10.1 matplotlib-inline 0.1.7 nest-asyncio 1.6.0 numba 0.61.0 numpy 2.1.3 packaging 24.2 pandas 2.2.3 parso 0.8.4 pillow 11.1.0 pip 24.2 platformdirs 4.3.6 prompt_toolkit 3.0.50 psutil 7.0.0 pure_eval 0.2.3 Pygments 2.19.1 pyparsing 3.2.1 python-dateutil 2.9.0.post0 pytz 2025.1 pywin32 309 pyzmq 26.3.0 scikit-learn 1.6.1 scipy 1.15.2 seaborn 0.13.2 shap 0.47.0 six 1.17.0 slicer 0.0.8 stack-data 0.6.3 threadpoolctl 3.5.0 tornado 6.4.2 tqdm 4.67.1 traitlets 5.14.3 typing_extensions 4.12.2 tzdata 2025.1 wcwidth 0.2.13 xgboost 2.1.4
Packages used for Virutal Environment¶
# Package Version
# ----------------------- -----------
# asttokens 3.0.0
# cloudpickle 3.1.1
# colorama 0.4.6
# comm 0.2.2
# contourpy 1.3.1
# cycler 0.12.1
# debugpy 1.8.13
# decorator 5.2.1
# executing 2.2.0
# fonttools 4.56.0
# ipykernel 6.29.5
# ipython 9.0.2
# ipython_pygments_lexers 1.1.1
# jedi 0.19.2
# joblib 1.4.2
# jupyter_client 8.6.3
# jupyter_core 5.7.2
# kiwisolver 1.4.8
# llvmlite 0.44.0
# matplotlib 3.10.1
# matplotlib-inline 0.1.7
# nest-asyncio 1.6.0
# numba 0.61.0
# numpy 2.1.3
# packaging 24.2
# pandas 2.2.3
# parso 0.8.4
# pillow 11.1.0
# pip 24.2
# platformdirs 4.3.6
# prompt_toolkit 3.0.50
# psutil 7.0.0
# pure_eval 0.2.3
# Pygments 2.19.1
# pyparsing 3.2.1
# python-dateutil 2.9.0.post0
# pytz 2025.1
# pywin32 309
# pyzmq 26.3.0
# scikit-learn 1.6.1
# scipy 1.15.2
# seaborn 0.13.2
# shap 0.47.0
# six 1.17.0
# slicer 0.0.8
# stack-data 0.6.3
# threadpoolctl 3.5.0
# tornado 6.4.2
# tqdm 4.67.1
# traitlets 5.14.3
# typing_extensions 4.12.2
# tzdata 2025.1
# wcwidth 0.2.13
# xgboost 2.1.4
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler, OneHotEncoder, RobustScaler, LabelEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score, StratifiedKFold
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, precision_recall_curve, auc, log_loss, roc_curve
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.neural_network import MLPClassifier
from sklearn.decomposition import PCA
import xgboost as xgb
import time
import pickle
import os
import warnings
warnings.filterwarnings('ignore')
# Set plot style
plt.style.use('ggplot')
#Increase the dispaly size of outpus and dataframes
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.expand_frame_repr', False)
Let's load the training and test data:
# Load data from CSV files
train_data = pd.read_csv('C:\Important Files\Code and Software\Python Projects\DataIku\Data\census_income_learn.csv', header=None)
test_data = pd.read_csv('C:\Important Files\Code and Software\Python Projects\DataIku\Data\census_income_test.csv', header=None)
print(test_data.shape, train_data.shape)
combined_data = pd.concat([train_data, test_data], axis=0, ignore_index=True)
print("Combined data shape:", combined_data.shape)
train_data.head()
(99762, 42) (199523, 42) Combined data shape: (299285, 42)
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28 | 29 | 30 | 31 | 32 | 33 | 34 | 35 | 36 | 37 | 38 | 39 | 40 | 41 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 73 | Not in universe | 0 | 0 | High school graduate | 0 | Not in universe | Widowed | Not in universe or children | Not in universe | White | All other | Female | Not in universe | Not in universe | Not in labor force | 0 | 0 | 0 | Nonfiler | Not in universe | Not in universe | Other Rel 18+ ever marr not in subfamily | Other relative of householder | 1700.09 | ? | ? | ? | Not in universe under 1 year old | ? | 0 | Not in universe | United-States | United-States | United-States | Native- Born in the United States | 0 | Not in universe | 2 | 0 | 95 | - 50000. |
| 1 | 58 | Self-employed-not incorporated | 4 | 34 | Some college but no degree | 0 | Not in universe | Divorced | Construction | Precision production craft & repair | White | All other | Male | Not in universe | Not in universe | Children or Armed Forces | 0 | 0 | 0 | Head of household | South | Arkansas | Householder | Householder | 1053.55 | MSA to MSA | Same county | Same county | No | Yes | 1 | Not in universe | United-States | United-States | United-States | Native- Born in the United States | 0 | Not in universe | 2 | 52 | 94 | - 50000. |
| 2 | 18 | Not in universe | 0 | 0 | 10th grade | 0 | High school | Never married | Not in universe or children | Not in universe | Asian or Pacific Islander | All other | Female | Not in universe | Not in universe | Not in labor force | 0 | 0 | 0 | Nonfiler | Not in universe | Not in universe | Child 18+ never marr Not in a subfamily | Child 18 or older | 991.95 | ? | ? | ? | Not in universe under 1 year old | ? | 0 | Not in universe | Vietnam | Vietnam | Vietnam | Foreign born- Not a citizen of U S | 0 | Not in universe | 2 | 0 | 95 | - 50000. |
| 3 | 9 | Not in universe | 0 | 0 | Children | 0 | Not in universe | Never married | Not in universe or children | Not in universe | White | All other | Female | Not in universe | Not in universe | Children or Armed Forces | 0 | 0 | 0 | Nonfiler | Not in universe | Not in universe | Child <18 never marr not in subfamily | Child under 18 never married | 1758.14 | Nonmover | Nonmover | Nonmover | Yes | Not in universe | 0 | Both parents present | United-States | United-States | United-States | Native- Born in the United States | 0 | Not in universe | 0 | 0 | 94 | - 50000. |
| 4 | 10 | Not in universe | 0 | 0 | Children | 0 | Not in universe | Never married | Not in universe or children | Not in universe | White | All other | Female | Not in universe | Not in universe | Children or Armed Forces | 0 | 0 | 0 | Nonfiler | Not in universe | Not in universe | Child <18 never marr not in subfamily | Child under 18 never married | 1069.16 | Nonmover | Nonmover | Nonmover | Yes | Not in universe | 0 | Both parents present | United-States | United-States | United-States | Native- Born in the United States | 0 | Not in universe | 0 | 0 | 94 | - 50000. |
Based on the metadata file, let's assign meaningful column names to our data:
# Column names from metadata
column_names = [
'age', 'class_of_worker', 'industry_code', 'occupation_code', 'education',
'wage_per_hour', 'enrolled_in_edu_inst', 'marital_status', 'major_industry_code',
'major_occupation_code', 'race', 'hispanic_origin', 'sex', 'member_of_labor_union',
'reason_for_unemployment', 'full_or_part_time_employment', 'capital_gains',
'capital_losses', 'dividends_from_stocks', 'tax_filer_status', 'region_of_previous_residence',
'state_of_previous_residence', 'detailed_household_summary', 'detailed_household_summary_in_household',
'instance_weight', 'migration_code_change_in_msa', 'migration_code_change_in_reg',
'migration_code_move_within_reg', 'live_in_this_house_1_year_ago', 'migration_prev_res_in_sunbelt',
'num_persons_worked_for_employer', 'family_members_under_18', 'country_of_birth_father',
'country_of_birth_mother', 'country_of_birth_self', 'citizenship', 'own_business_or_self_employed',
'fill_inc_questionnaire_for_veteran', 'veterans_benefits', 'weeks_worked_in_year', 'year', 'income'
]
# Apply column names
train_data.columns = column_names
test_data.columns = column_names
combined_data.columns = column_names
# Check the income target distribution
combined_data['income'].value_counts()
income - 50000. 280717 50000+. 18568 Name: count, dtype: int64
2. Data Check¶
Let's analyze the data to understand its structure, missing values, and data types.
# Check data types and missing data
def analyze_dataframe(df):
# Replace blank values with NaN
df = df.replace(r'^\s*$', np.nan, regex=True)
# Get the data type for each column
data_types = df.dtypes
# Get the number of missing values for each column
missing_values = df.isnull().sum()
# Calculate the percentage of missing values for each column
missing_percentage = (missing_values / len(df)) * 100
# Get the number of unique values for each column
unique_values = df.nunique()
# Combine all the information into a single DataFrame
analysis_df = pd.DataFrame({
'Data Type': data_types,
'Missing Values': missing_values,
'% Missing': missing_percentage,
'Unique Values': unique_values
})
return analysis_df
def count_data_types(df):
# Get the data type for each column and count occurrences of each type
data_type_counts = df.dtypes.value_counts()
return data_type_counts
# Analyze the DataFrame
analysis_result = analyze_dataframe(combined_data)
# Print the result
print("Data Analysis:")
print(analysis_result)
# Count data types
data_type_counts = count_data_types(combined_data)
# Print the data type counts
print("\nData Type Counts:")
print(data_type_counts)
Data Analysis:
Data Type Missing Values % Missing Unique Values
age int64 0 0.0 91
class_of_worker object 0 0.0 9
industry_code int64 0 0.0 52
occupation_code int64 0 0.0 47
education object 0 0.0 17
wage_per_hour int64 0 0.0 1425
enrolled_in_edu_inst object 0 0.0 3
marital_status object 0 0.0 7
major_industry_code object 0 0.0 24
major_occupation_code object 0 0.0 15
race object 0 0.0 5
hispanic_origin object 0 0.0 10
sex object 0 0.0 2
member_of_labor_union object 0 0.0 3
reason_for_unemployment object 0 0.0 6
full_or_part_time_employment object 0 0.0 8
capital_gains int64 0 0.0 133
capital_losses int64 0 0.0 114
dividends_from_stocks int64 0 0.0 1675
tax_filer_status object 0 0.0 6
region_of_previous_residence object 0 0.0 6
state_of_previous_residence object 0 0.0 51
detailed_household_summary object 0 0.0 38
detailed_household_summary_in_household object 0 0.0 8
instance_weight float64 0 0.0 123232
migration_code_change_in_msa object 0 0.0 10
migration_code_change_in_reg object 0 0.0 9
migration_code_move_within_reg object 0 0.0 10
live_in_this_house_1_year_ago object 0 0.0 3
migration_prev_res_in_sunbelt object 0 0.0 4
num_persons_worked_for_employer int64 0 0.0 7
family_members_under_18 object 0 0.0 5
country_of_birth_father object 0 0.0 43
country_of_birth_mother object 0 0.0 43
country_of_birth_self object 0 0.0 43
citizenship object 0 0.0 5
own_business_or_self_employed int64 0 0.0 3
fill_inc_questionnaire_for_veteran object 0 0.0 3
veterans_benefits int64 0 0.0 3
weeks_worked_in_year int64 0 0.0 53
year int64 0 0.0 2
income object 0 0.0 2
Data Type Counts:
object 29
int64 12
float64 1
Name: count, dtype: int64
Let's check for duplicate rows in our data and drop them consistently across sets:
We need to track dropped rows to ensure consistency between train and test sets.
# Function to track and consistently drop duplicates across datasets
def track_and_drop_duplicates(combined_df, train_df, test_df):
# Get initial shapes
print(f"Before: Combined data shape: {combined_df.shape}")
print(f"Before: Train data shape: {train_df.shape}")
print(f"Before: Test data shape: {test_df.shape}")
# Check initial duplicates
duplicates_count = combined_df.duplicated().sum()
print(f"Number of duplicate rows in combined dataset: {duplicates_count}")
# Get the indices of the duplicate rows in the combined dataset
duplicate_mask = combined_df.duplicated(keep='first')
duplicate_indices = combined_df[duplicate_mask].index.tolist()
# Split these indices into train and test
train_size = train_df.shape[0]
train_duplicate_indices = [idx for idx in duplicate_indices if idx < train_size]
test_duplicate_indices = [idx - train_size for idx in duplicate_indices if idx >= train_size]
print(f"Duplicates in train: {len(train_duplicate_indices)}")
print(f"Duplicates in test: {len(test_duplicate_indices)}")
# Drop duplicates from all datasets
combined_df_clean = combined_df.drop_duplicates(keep='first').reset_index(drop=True)
train_df_clean = train_df.drop(index=train_duplicate_indices, errors='ignore').reset_index(drop=True)
test_df_clean = test_df.drop(index=test_duplicate_indices, errors='ignore').reset_index(drop=True)
# Cross-check consistency
print(f"After: Combined data shape: {combined_df_clean.shape}")
print(f"After: Train data shape: {train_df_clean.shape}")
print(f"After: Test data shape: {test_df_clean.shape}")
print(f"Sum of train + test: {train_df_clean.shape[0] + test_df_clean.shape[0]}")
return combined_df_clean, train_df_clean, test_df_clean
# Apply the function
combined_data_clean, train_data_clean, test_data_clean = track_and_drop_duplicates(
combined_data, train_data, test_data
)
Before: Combined data shape: (299285, 42) Before: Train data shape: (199523, 42) Before: Test data shape: (99762, 42) Number of duplicate rows in combined dataset: 6735 Duplicates in train: 3229 Duplicates in test: 3506 After: Combined data shape: (292550, 42) After: Train data shape: (196294, 42) After: Test data shape: (96256, 42) Sum of train + test: 292550
# Target variable distribution - Fixed version
plt.figure(figsize=(10, 6))
income_counts = combined_data['income'].value_counts()
plt.bar(income_counts.index, income_counts.values, color=['steelblue', 'coral'])
plt.title('Distribution of Income Level', fontsize=15)
plt.xlabel('Income Level', fontsize=12)
plt.ylabel('Count', fontsize=12)
plt.xticks(rotation=0)
# Add percentage labels
total = income_counts.sum()
for i, count in enumerate(income_counts.values):
percentage = count / total * 100
plt.text(i, count, f'{percentage:.1f}%', ha='center', va='bottom', fontsize=12)
plt.tight_layout()
plt.show()
# Print the exact numbers - using actual values in the dataset
print(f"Income distribution:\n{income_counts}")
# Check actual values in the dataset
print(f"Unique income values: {combined_data['income'].unique()}")
# Calculate percentages using the actual values in the dataset
for income_level in combined_data['income'].unique():
percentage = (combined_data['income'] == income_level).mean() * 100
print(f"Percentage {income_level}: {percentage:.2f}%")
Income distribution: income - 50000. 280717 50000+. 18568 Name: count, dtype: int64 Unique income values: [' - 50000.' ' 50000+.'] Percentage - 50000.: 93.80% Percentage 50000+.: 6.20%
3.2 Numerical Features Analysis¶
Let's examine the distributions of numerical features and their relationships with income.
# Identify numerical columns based on metadata (7 continuous variables)
numerical_columns = ['age', 'wage_per_hour', 'capital_gains', 'capital_losses',
'dividends_from_stocks', 'num_persons_worked_for_employer', 'weeks_worked_in_year']
# Create histograms for numerical features
plt.figure(figsize=(20, 15))
for i, col in enumerate(numerical_columns):
plt.subplot(3, 3, i+1)
sns.histplot(combined_data[col], kde=True)
plt.title(f'Distribution of {col}')
plt.tight_layout()
plt.show()
# Box plots for numerical features by income
plt.figure(figsize=(20, 15))
for i, col in enumerate(numerical_columns):
plt.subplot(3, 3, i+1)
sns.boxplot(x='income', y=col, data=combined_data)
plt.title(f'{col} by Income Level')
plt.tight_layout()
plt.show()
# Correlation matrix for numerical features
plt.figure(figsize=(12, 10))
correlation_matrix = combined_data[numerical_columns].corr()
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt='.2f')
plt.title('Correlation Matrix of Numerical Features', fontsize=15)
plt.tight_layout()
plt.show()
3.3 Categorical Features Analysis¶
Now let's examine some key categorical features and their relationships with income.
# Select important categorical features for analysis
categorical_columns = [
'class_of_worker',
'industry_code',
'occupation_code',
'education',
'enrolled_in_edu_inst',
'marital_status',
'major_industry_code',
'major_occupation_code',
'race',
'hispanic_origin',
'sex',
'member_of_labor_union',
'reason_for_unemployment',
'full_or_part_time_employment',
'tax_filer_status',
'region_of_previous_residence',
'state_of_previous_residence',
'detailed_household_summary',
'detailed_household_summary_in_household',
'migration_code_change_in_msa',
'migration_code_change_in_reg',
'migration_code_move_within_reg',
'live_in_this_house_1_year_ago',
'migration_prev_res_in_sunbelt',
'family_members_under_18',
'country_of_birth_father',
'country_of_birth_mother',
'country_of_birth_self',
'citizenship',
'own_business_or_self_employed',
'fill_inc_questionnaire_for_veteran',
'veterans_benefits',
'year'
]
# Create count plots for categorical features
for col in categorical_columns:
plt.figure(figsize=(12, 6))
# Get value counts for the feature
value_counts = combined_data[col].value_counts().nlargest(10)
# Plot the 10 most common values
sns.countplot(y=col, data=combined_data, order=value_counts.index)
plt.title(f'Distribution of {col} (Top 10 Categories)', fontsize=15)
plt.tight_layout()
plt.show()
# Stacked bar chart showing income distribution by category
plt.figure(figsize=(14, 8))
# Prepare data for stacked bar chart
cross_tab = pd.crosstab(
combined_data[col],
combined_data['income'],
normalize='index'
) * 100 # Convert to percentage
# Plot only the top 10 categories
cross_tab.loc[value_counts.index].plot(kind='barh', stacked=True,
colormap='coolwarm')
plt.title(f'Income Distribution by {col} (Top 10 Categories)', fontsize=15)
plt.xlabel('Percentage', fontsize=12)
plt.tight_layout()
plt.show()
<Figure size 1400x800 with 0 Axes>
<Figure size 1400x800 with 0 Axes>
<Figure size 1400x800 with 0 Axes>
<Figure size 1400x800 with 0 Axes>
<Figure size 1400x800 with 0 Axes>
<Figure size 1400x800 with 0 Axes>
<Figure size 1400x800 with 0 Axes>
<Figure size 1400x800 with 0 Axes>
<Figure size 1400x800 with 0 Axes>
<Figure size 1400x800 with 0 Axes>
<Figure size 1400x800 with 0 Axes>
<Figure size 1400x800 with 0 Axes>
<Figure size 1400x800 with 0 Axes>
<Figure size 1400x800 with 0 Axes>
<Figure size 1400x800 with 0 Axes>
<Figure size 1400x800 with 0 Axes>
<Figure size 1400x800 with 0 Axes>
<Figure size 1400x800 with 0 Axes>
<Figure size 1400x800 with 0 Axes>
<Figure size 1400x800 with 0 Axes>
<Figure size 1400x800 with 0 Axes>
<Figure size 1400x800 with 0 Axes>
<Figure size 1400x800 with 0 Axes>
<Figure size 1400x800 with 0 Axes>
<Figure size 1400x800 with 0 Axes>
<Figure size 1400x800 with 0 Axes>
<Figure size 1400x800 with 0 Axes>
<Figure size 1400x800 with 0 Axes>
<Figure size 1400x800 with 0 Axes>
<Figure size 1400x800 with 0 Axes>
<Figure size 1400x800 with 0 Axes>
<Figure size 1400x800 with 0 Axes>
<Figure size 1400x800 with 0 Axes>
4. Data Preparation¶
Now that we've explored the data, let's prepare it for modeling.
4.1 Data Cleaning¶
# Function to clean data - fixed with correct values
def clean_data(df):
# Make a copy of the dataframe
df_clean = df.copy()
# Convert target to binary with the correct values (note the leading space)
df_clean['income'] = df_clean['income'].map({' - 50000.': 0, ' 50000+.': 1})
# Replace '?' values with np.nan
df_clean = df_clean.replace('?', np.nan)
return df_clean
# Clean the training and test data
train_data_clean = clean_data(train_data_clean)
test_data_clean = clean_data(test_data_clean)
combined_data_clean = clean_data(combined_data_clean)
# Check missing values after cleaning
train_missing = train_data_clean.isnull().sum()[train_data_clean.isnull().sum() > 0]
print("Missing values in training data:")
print(train_missing)
Missing values in training data: Series([], dtype: int64)
4.2 Identify Categorical Features and Their Unique Values¶
categorical_columns = [
'class_of_worker', 'industry_code', 'occupation_code', 'education',
'enrolled_in_edu_inst', 'marital_status', 'major_industry_code',
'major_occupation_code', 'race', 'hispanic_origin', 'sex',
'member_of_labor_union', 'reason_for_unemployment',
'full_or_part_time_employment', 'tax_filer_status',
'region_of_previous_residence', 'state_of_previous_residence',
'detailed_household_summary', 'detailed_household_summary_in_household',
'migration_code_change_in_msa', 'migration_code_change_in_reg',
'migration_code_move_within_reg', 'live_in_this_house_1_year_ago',
'migration_prev_res_in_sunbelt', 'family_members_under_18',
'country_of_birth_father', 'country_of_birth_mother',
'country_of_birth_self', 'citizenship', 'own_business_or_self_employed',
'fill_inc_questionnaire_for_veteran', 'veterans_benefits', 'year'
]
# Function to identify categorical features and their unique values
def identify_categorical_features(df):
# Identify categorical columns
print(f"Identified {len(categorical_columns)} categorical features")
# Analyze unique values for each categorical column
for col in categorical_columns:
n_unique = df[col].nunique()
unique_values = df[col].unique()
print(f"{col}: {n_unique} unique values")
print(f"Unique values: {unique_values}")
print("-" * 50)
return categorical_columns
# Apply the function to see unique values
categorical_columns = identify_categorical_features(combined_data_clean)
Identified 33 categorical features class_of_worker: 9 unique values Unique values: [' Not in universe' ' Self-employed-not incorporated' ' Private' ' Local government' ' Federal government' ' Self-employed-incorporated' ' State government' ' Never worked' ' Without pay'] -------------------------------------------------- industry_code: 52 unique values Unique values: [ 0 4 40 34 43 37 24 39 12 35 45 3 19 29 32 48 33 23 44 36 31 30 41 5 11 9 42 6 18 50 2 1 26 47 16 14 22 17 7 8 25 46 27 15 13 49 38 21 28 20 51 10] -------------------------------------------------- occupation_code: 47 unique values Unique values: [ 0 34 10 3 40 26 37 31 12 36 41 22 2 35 25 23 42 8 19 29 27 16 33 13 18 9 17 39 32 11 30 38 20 7 21 44 24 43 28 4 1 6 45 14 5 15 46] -------------------------------------------------- education: 17 unique values Unique values: [' High school graduate' ' Some college but no degree' ' 10th grade' ' Children' ' Bachelors degree(BA AB BS)' ' Masters degree(MA MS MEng MEd MSW MBA)' ' Less than 1st grade' ' Associates degree-academic program' ' 7th and 8th grade' ' 12th grade no diploma' ' Associates degree-occup /vocational' ' Prof school degree (MD DDS DVM LLB JD)' ' 5th or 6th grade' ' 11th grade' ' Doctorate degree(PhD EdD)' ' 9th grade' ' 1st 2nd 3rd or 4th grade'] -------------------------------------------------- enrolled_in_edu_inst: 3 unique values Unique values: [' Not in universe' ' High school' ' College or university'] -------------------------------------------------- marital_status: 7 unique values Unique values: [' Widowed' ' Divorced' ' Never married' ' Married-civilian spouse present' ' Separated' ' Married-spouse absent' ' Married-A F spouse present'] -------------------------------------------------- major_industry_code: 24 unique values Unique values: [' Not in universe or children' ' Construction' ' Entertainment' ' Finance insurance and real estate' ' Education' ' Business and repair services' ' Manufacturing-nondurable goods' ' Personal services except private HH' ' Manufacturing-durable goods' ' Other professional services' ' Mining' ' Transportation' ' Wholesale trade' ' Public administration' ' Retail trade' ' Social services' ' Private household services' ' Utilities and sanitary services' ' Communications' ' Hospital services' ' Medical except hospital' ' Agriculture' ' Forestry and fisheries' ' Armed Forces'] -------------------------------------------------- major_occupation_code: 15 unique values Unique values: [' Not in universe' ' Precision production craft & repair' ' Professional specialty' ' Executive admin and managerial' ' Handlers equip cleaners etc ' ' Adm support including clerical' ' Machine operators assmblrs & inspctrs' ' Other service' ' Sales' ' Private household services' ' Technicians and related support' ' Transportation and material moving' ' Farming forestry and fishing' ' Protective services' ' Armed Forces'] -------------------------------------------------- race: 5 unique values Unique values: [' White' ' Asian or Pacific Islander' ' Amer Indian Aleut or Eskimo' ' Black' ' Other'] -------------------------------------------------- hispanic_origin: 10 unique values Unique values: [' All other' ' Do not know' ' Central or South American' ' Mexican (Mexicano)' ' Mexican-American' ' Other Spanish' ' Puerto Rican' ' Cuban' ' Chicano' ' NA'] -------------------------------------------------- sex: 2 unique values Unique values: [' Female' ' Male'] -------------------------------------------------- member_of_labor_union: 3 unique values Unique values: [' Not in universe' ' No' ' Yes'] -------------------------------------------------- reason_for_unemployment: 6 unique values Unique values: [' Not in universe' ' Job loser - on layoff' ' Other job loser' ' New entrant' ' Re-entrant' ' Job leaver'] -------------------------------------------------- full_or_part_time_employment: 8 unique values Unique values: [' Not in labor force' ' Children or Armed Forces' ' Full-time schedules' ' Unemployed full-time' ' Unemployed part- time' ' PT for non-econ reasons usually FT' ' PT for econ reasons usually PT' ' PT for econ reasons usually FT'] -------------------------------------------------- tax_filer_status: 6 unique values Unique values: [' Nonfiler' ' Head of household' ' Joint both under 65' ' Single' ' Joint both 65+' ' Joint one under 65 & one 65+'] -------------------------------------------------- region_of_previous_residence: 6 unique values Unique values: [' Not in universe' ' South' ' Northeast' ' Midwest' ' West' ' Abroad'] -------------------------------------------------- state_of_previous_residence: 51 unique values Unique values: [' Not in universe' ' Arkansas' ' Utah' ' Michigan' ' Minnesota' ' Alaska' ' Kansas' ' Indiana' ' ?' ' Massachusetts' ' New Mexico' ' Nevada' ' Tennessee' ' Colorado' ' Abroad' ' Kentucky' ' California' ' Arizona' ' North Carolina' ' Connecticut' ' Florida' ' Vermont' ' Maryland' ' Oklahoma' ' Oregon' ' Ohio' ' South Carolina' ' Texas' ' Montana' ' Wyoming' ' Georgia' ' Pennsylvania' ' Iowa' ' New Hampshire' ' Missouri' ' Alabama' ' North Dakota' ' New Jersey' ' Louisiana' ' West Virginia' ' Delaware' ' Illinois' ' Maine' ' Wisconsin' ' New York' ' Idaho' ' District of Columbia' ' South Dakota' ' Nebraska' ' Virginia' ' Mississippi'] -------------------------------------------------- detailed_household_summary: 38 unique values Unique values: [' Other Rel 18+ ever marr not in subfamily' ' Householder' ' Child 18+ never marr Not in a subfamily' ' Child <18 never marr not in subfamily' ' Spouse of householder' ' Secondary individual' ' Other Rel 18+ never marr not in subfamily' ' Nonfamily householder' ' Grandchild <18 never marr not in subfamily' ' Grandchild <18 never marr child of subfamily RP' ' Child 18+ ever marr Not in a subfamily' ' Child 18+ never marr RP of subfamily' ' Child 18+ spouse of subfamily RP' ' Other Rel <18 never marr child of subfamily RP' ' Child under 18 of RP of unrel subfamily' ' Grandchild 18+ never marr not in subfamily' ' Child 18+ ever marr RP of subfamily' ' Other Rel 18+ ever marr RP of subfamily' ' RP of unrelated subfamily' ' Other Rel 18+ spouse of subfamily RP' ' Other Rel <18 never marr not in subfamily' ' Other Rel <18 spouse of subfamily RP' ' In group quarters' ' Grandchild 18+ spouse of subfamily RP' ' Other Rel 18+ never marr RP of subfamily' ' Child <18 never marr RP of subfamily' ' Child <18 ever marr not in subfamily' ' Other Rel <18 ever marr RP of subfamily' ' Grandchild 18+ ever marr not in subfamily' ' Child <18 spouse of subfamily RP' ' Spouse of RP of unrelated subfamily' ' Other Rel <18 never married RP of subfamily' ' Grandchild 18+ never marr RP of subfamily' ' Grandchild 18+ ever marr RP of subfamily' ' Child <18 ever marr RP of subfamily' ' Other Rel <18 ever marr not in subfamily' ' Grandchild <18 never marr RP of subfamily' ' Grandchild <18 ever marr not in subfamily'] -------------------------------------------------- detailed_household_summary_in_household: 8 unique values Unique values: [' Other relative of householder' ' Householder' ' Child 18 or older' ' Child under 18 never married' ' Spouse of householder' ' Nonrelative of householder' ' Group Quarters- Secondary individual' ' Child under 18 ever married'] -------------------------------------------------- migration_code_change_in_msa: 10 unique values Unique values: [' ?' ' MSA to MSA' ' Nonmover' ' NonMSA to nonMSA' ' Not in universe' ' Not identifiable' ' Abroad to MSA' ' MSA to nonMSA' ' Abroad to nonMSA' ' NonMSA to MSA'] -------------------------------------------------- migration_code_change_in_reg: 9 unique values Unique values: [' ?' ' Same county' ' Nonmover' ' Different region' ' Different county same state' ' Not in universe' ' Different division same region' ' Abroad' ' Different state same division'] -------------------------------------------------- migration_code_move_within_reg: 10 unique values Unique values: [' ?' ' Same county' ' Nonmover' ' Different state in South' ' Different county same state' ' Not in universe' ' Different state in Northeast' ' Abroad' ' Different state in Midwest' ' Different state in West'] -------------------------------------------------- live_in_this_house_1_year_ago: 3 unique values Unique values: [' Not in universe under 1 year old' ' No' ' Yes'] -------------------------------------------------- migration_prev_res_in_sunbelt: 4 unique values Unique values: [' ?' ' Yes' ' Not in universe' ' No'] -------------------------------------------------- family_members_under_18: 5 unique values Unique values: [' Not in universe' ' Both parents present' ' Mother only present' ' Neither parent present' ' Father only present'] -------------------------------------------------- country_of_birth_father: 43 unique values Unique values: [' United-States' ' Vietnam' ' Philippines' ' ?' ' Columbia' ' Germany' ' Mexico' ' Japan' ' Peru' ' Dominican-Republic' ' South Korea' ' Cuba' ' El-Salvador' ' Canada' ' Scotland' ' Outlying-U S (Guam USVI etc)' ' Italy' ' Guatemala' ' Ecuador' ' Puerto-Rico' ' Cambodia' ' China' ' Poland' ' Nicaragua' ' Taiwan' ' England' ' Ireland' ' Hungary' ' Yugoslavia' ' Trinadad&Tobago' ' Jamaica' ' Honduras' ' Portugal' ' Iran' ' France' ' India' ' Hong Kong' ' Haiti' ' Greece' ' Holand-Netherlands' ' Thailand' ' Laos' ' Panama'] -------------------------------------------------- country_of_birth_mother: 43 unique values Unique values: [' United-States' ' Vietnam' ' ?' ' Columbia' ' Mexico' ' El-Salvador' ' Peru' ' Puerto-Rico' ' Cuba' ' Philippines' ' Dominican-Republic' ' Germany' ' England' ' Guatemala' ' Scotland' ' Portugal' ' Italy' ' Ecuador' ' Yugoslavia' ' China' ' Poland' ' Hungary' ' Nicaragua' ' Taiwan' ' Ireland' ' Canada' ' South Korea' ' Trinadad&Tobago' ' Jamaica' ' Honduras' ' Iran' ' France' ' Cambodia' ' India' ' Hong Kong' ' Haiti' ' Japan' ' Greece' ' Holand-Netherlands' ' Thailand' ' Panama' ' Laos' ' Outlying-U S (Guam USVI etc)'] -------------------------------------------------- country_of_birth_self: 43 unique values Unique values: [' United-States' ' Vietnam' ' ?' ' Columbia' ' Mexico' ' Peru' ' Cuba' ' Philippines' ' Dominican-Republic' ' El-Salvador' ' Canada' ' Scotland' ' Portugal' ' Guatemala' ' Ecuador' ' Germany' ' Outlying-U S (Guam USVI etc)' ' Puerto-Rico' ' Italy' ' China' ' Poland' ' Nicaragua' ' Taiwan' ' England' ' Ireland' ' South Korea' ' Trinadad&Tobago' ' Jamaica' ' Honduras' ' Iran' ' Hungary' ' France' ' Cambodia' ' India' ' Hong Kong' ' Japan' ' Haiti' ' Holand-Netherlands' ' Greece' ' Thailand' ' Panama' ' Yugoslavia' ' Laos'] -------------------------------------------------- citizenship: 5 unique values Unique values: [' Native- Born in the United States' ' Foreign born- Not a citizen of U S ' ' Foreign born- U S citizen by naturalization' ' Native- Born abroad of American Parent(s)' ' Native- Born in Puerto Rico or U S Outlying'] -------------------------------------------------- own_business_or_self_employed: 3 unique values Unique values: [0 2 1] -------------------------------------------------- fill_inc_questionnaire_for_veteran: 3 unique values Unique values: [' Not in universe' ' No' ' Yes'] -------------------------------------------------- veterans_benefits: 3 unique values Unique values: [2 0 1] -------------------------------------------------- year: 2 unique values Unique values: [95 94] --------------------------------------------------
4.3 Feature Engineering¶
# Enhanced feature engineering function with integrated label encoding
def engineer_features(df):
df = df.copy()
# Create work experience feature (assuming people start working at age 18)
df['work_experience'] = df['age'] - 18
df.loc[df['work_experience'] < 0, 'work_experience'] = 0
# Create a feature for capital gains/losses ratio
df['capital_ratio'] = df['capital_gains'] / (df['capital_losses'] + 1) # Adding 1 to avoid division by zero
# Binary feature for full year worker
df['full_year_worker'] = (df['weeks_worked_in_year'] >= 50).astype(int)
# Create binary features for capital gains/losses and dividends
df['has_capital_gains'] = (df['capital_gains'] > 7000).astype(int)
df['has_capital_losses'] = (df['capital_losses'] > 0).astype(int)
df['has_dividends'] = (df['dividends_from_stocks'] > 0).astype(int)
# Marital status simplified
df['is_married'] = df['marital_status'].str.contains('Married', case=False).astype(int)
return df
# Apply feature engineering
train_data_fe = engineer_features(train_data_clean)
test_data_fe = engineer_features(test_data_clean)
combined_data_fe = engineer_features(combined_data_clean)
# Check the new features
print("Engineered features added. New dataframe shape:", train_data_fe.shape)
Engineered features added. New dataframe shape: (196294, 49)
4.4 Encode Categorical Variables¶
def encode_categorical(df):
df = df.copy()
label_encoded_columns = [] # Track columns that were label encoded
# 1. Label Encoding
def label_encode(df):
# Define mappings
education_mapping = {
' Less than 1st grade': 0, ' 1st 2nd 3rd or 4th grade': 1,
' 5th or 6th grade': 2, ' 7th and 8th grade': 3,
' 9th grade': 4, ' 10th grade': 5, ' 11th grade': 6,
' 12th grade no diploma': 7, ' High school graduate': 8,
' Some college but no degree': 9, ' Associates degree-occup /vocational': 10,
' Associates degree-academic program': 11, ' Bachelors degree(BA AB BS)': 12,
' Masters degree(MA MS MEng MEd MSW MBA)': 13,
' Prof school degree (MD DDS DVM LLB JD)': 14,
' Doctorate degree(PhD EdD)': 15, ' Children': -1}
# Company size categorical feature
company_size_map = {
'Not in universe': 0,
'under 10': 1,
'10 - 24': 2,
'25 - 99': 3,
'100 - 499': 4,
'500 - 999': 5,
'1000+': 6
}
simple_mapping = {
' Not in universe': 0, ' No': 1, ' Yes': 2
}
enrolled_mapping = {
' Not in universe': 0, ' High school': 1, ' College or university': 2}
live_in_house_mapping = {
' Not in universe under 1 year old': 0, ' No': 1, ' Yes': 2}
# Apply mappings
label_encode_cols = {
'sex': {' Female': 0, ' Male': 1},
'education': education_mapping,
'enrolled_in_edu_inst': enrolled_mapping,
'member_of_labor_union': simple_mapping,
'live_in_this_house_1_year_ago': live_in_house_mapping,
'fill_inc_questionnaire_for_veteran': simple_mapping,
'num_persons_worked_for_employer': company_size_map
}
# Create new columns with encoded values
encoded_columns = []
for col, mapping in label_encode_cols.items():
if col in df.columns:
new_col_name = f"{col}_encoded"
df[new_col_name] = df[col].map(mapping)
encoded_columns.append(col)
return df, encoded_columns, list(label_encode_cols.keys())
# Apply label encoding and get list of columns that were encoded
df, label_encoded_columns, label_encode_keys = label_encode(df)
# 2. One-Hot Encoding
# Identify categorical columns, excluding 'income'
categorical_cols = [col for col in df.columns if
df[col].dtype == 'object' and col != 'income' and
not col.endswith('_encoded')]
# One-hot encode remaining categorical columns
if categorical_cols:
encoder = OneHotEncoder(handle_unknown='ignore', sparse_output=False)
encoded = encoder.fit_transform(df[categorical_cols])
encoded_df = pd.DataFrame(
encoded,
columns=encoder.get_feature_names_out(categorical_cols),
index=df.index
)
df = pd.concat([df, encoded_df], axis=1)
# 3. Numeric Conversion
numeric_cols = ['occupation_code', 'industry_code', 'year', 'veterans_benefits', 'own_business_or_self_employed']
for col in numeric_cols:
if col in df.columns:
df[col] = pd.to_numeric(df[col], errors='coerce')
# 4. Drop original categorical columns
cols_to_drop = categorical_cols + label_encode_keys
df = df.drop(columns=cols_to_drop, errors='ignore')
return df
# Encode all datasets
train_data_enc = encode_categorical(train_data_fe)
test_data_enc = encode_categorical(test_data_fe)
combined_data_enc = encode_categorical(combined_data_fe)
print("Engineered features added. Train dataframe shape:", train_data_enc.shape)
print("Engineered features added. Test dataframe shape:", test_data_enc.shape)
print("Engineered features added. Combined dataframe shape:", combined_data_enc.shape)
Engineered features added. Train dataframe shape: (196294, 423) Engineered features added. Test dataframe shape: (96256, 422) Engineered features added. Combined dataframe shape: (292550, 423)
4.4.1 Drop new columns not common in both train and test¶
def align_columns(train_df, test_df):
"""
Ensures that the training and test DataFrames have the same columns.
Args:
train_df (pd.DataFrame): The training DataFrame.
test_df (pd.DataFrame): The test DataFrame.
Returns:
tuple: The modified training and test DataFrames with aligned columns.
"""
train_cols = set(train_df.columns)
test_cols = set(test_df.columns)
# Find columns unique to train
train_unique = train_cols - test_cols
# Find columns unique to test
test_unique = test_cols - train_cols
# Print columns to drop from train
if train_unique:
print("Columns unique to training data:")
for col in train_unique:
print(f" - {col}")
train_df = train_df.drop(columns=train_unique, errors='ignore')
# Print columns to drop from test
if test_unique:
print("\nColumns unique to test data:")
for col in test_unique:
print(f" - {col}")
test_df = test_df.drop(columns=test_unique, errors='ignore')
return train_df, test_df
# Align columns between training and test sets
train_data_aligned, test_data_aligned = align_columns(train_data_enc, test_data_enc)
# Verify the shapes after alignment
print("\nTraining data shape after alignment:", train_data_aligned.shape)
print("Test data shape after alignment:", test_data_aligned.shape)
Columns unique to training data: - detailed_household_summary_ Grandchild <18 ever marr not in subfamily Training data shape after alignment: (196294, 422) Test data shape after alignment: (96256, 422)
4.4.2 Drop Features with High Null Percentage¶
def drop_high_null_features(train_df, test_df, threshold=0.3):
"""
Removes features that have a percentage of null values above the specified threshold.
Args:
train_df (pd.DataFrame): Training dataframe
test_df (pd.DataFrame): Test dataframe
threshold (float): Maximum allowed percentage of nulls (0.0 to 1.0)
Returns:
tuple: Clean training and test dataframes with high-null features removed
"""
# Make copies to avoid modifying the originals
train_df = train_df.copy()
test_df = test_df.copy()
# Calculate null percentages for each column in training data
null_percentages = train_df.isnull().mean()
# Identify columns to drop (excluding 'income')
high_null_cols = null_percentages[
(null_percentages > threshold) &
(null_percentages.index != 'income')
].index.tolist()
# Log the columns being dropped
if high_null_cols:
print(f"\nRemoving {len(high_null_cols)} features with >{threshold*100:.1f}% null values:")
for col in high_null_cols:
print(f" - {col}: {null_percentages[col]*100:.2f}% nulls")
# Drop the identified columns from both datasets
train_df = train_df.drop(columns=high_null_cols)
test_df = test_df.drop(columns=high_null_cols)
print(f"\nAfter dropping high-null features:")
print(f" Training shape: {train_df.shape}")
print(f" Test shape: {test_df.shape}")
else:
print(f"\nNo features exceeded the {threshold*100:.1f}% null threshold")
return train_df, test_df
# Apply the function to remove high-null features
train_data_nulls_dropped, test_data_nulls_dropped = drop_high_null_features(
train_data_aligned, test_data_aligned, threshold=0.3
)
# Continue with the clean datasets
# (Rename the variables to maintain naming consistency with subsequent steps)
train_data_aligned = train_data_nulls_dropped
test_data_aligned = test_data_nulls_dropped
Removing 1 features with >30.0% null values: - num_persons_worked_for_employer_encoded: 100.00% nulls After dropping high-null features: Training shape: (196294, 421) Test shape: (96256, 421)
4.5 Split Data¶
# Limit dataset size if needed
# train_data_aligned = train_data_aligned.iloc[:50000]
# test_data_aligned = test_data_aligned[:50000]
# Split into train/test
X_train = train_data_aligned.drop('income', axis=1)
y_train = train_data_aligned['income']
X_test = test_data_aligned.drop('income', axis=1)
y_test = test_data_aligned['income']
# Create validation set
# Create validation set with stratification
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.2,
random_state=42, stratify=y_train)
print(X_train.shape, y_train.shape, X_val.shape, y_val.shape)
(157035, 420) (157035,) (39259, 420) (39259,)
4.6 Handle Missing Values¶
# Identify numeric and categorical features
num_features = X_train.select_dtypes(include=np.number).columns.tolist()
cat_features = X_train.select_dtypes(include='object').columns.tolist()
# Imputation
num_imputer = SimpleImputer(strategy='median')
cat_imputer = SimpleImputer(strategy='most_frequent')
# Apply imputation to numeric features
if num_features:
X_train[num_features] = num_imputer.fit_transform(X_train[num_features])
X_val[num_features] = num_imputer.transform(X_val[num_features])
X_test[num_features] = num_imputer.transform(X_test[num_features])
# Apply imputation to categorical features
if cat_features:
X_train[cat_features] = cat_imputer.fit_transform(X_train[cat_features])
X_val[cat_features] = cat_imputer.transform(X_val[cat_features])
X_test[cat_features] = cat_imputer.transform(X_test[cat_features])
# Verify imputation was successful
train_missing = X_train.isnull().sum().sum()
val_missing = X_val.isnull().sum().sum()
test_missing = X_test.isnull().sum().sum()
print("Missing values after imputation:")
print(f"Training data: {train_missing}")
print(f"Validation data: {val_missing}")
print(f"Test data: {test_missing}")
# If any missing values remain, apply a second pass of imputation
if train_missing > 0 or val_missing > 0 or test_missing > 0:
print("Warning: Some values couldn't be imputed, applying a fallback strategy")
# Apply a more aggressive fallback imputation strategy
fallback_imputer = SimpleImputer(strategy='constant', fill_value=0)
# Get columns that still have missing values
train_missing_cols = X_train.columns[X_train.isnull().any()].tolist()
val_missing_cols = X_val.columns[X_val.isnull().any()].tolist()
test_missing_cols = X_test.columns[X_test.isnull().any()].tolist()
all_missing_cols = list(set(train_missing_cols + val_missing_cols + test_missing_cols))
if all_missing_cols:
X_train[all_missing_cols] = fallback_imputer.fit_transform(X_train[all_missing_cols])
X_val[all_missing_cols] = fallback_imputer.transform(X_val[all_missing_cols])
X_test[all_missing_cols] = fallback_imputer.transform(X_test[all_missing_cols])
# Check again
print("Missing values after fallback imputation:")
print(f"Training data: {X_train.isnull().sum().sum()}")
print(f"Validation data: {X_val.isnull().sum().sum()}")
print(f"Test data: {X_test.isnull().sum().sum()}")
Missing values after imputation: Training data: 0 Validation data: 0 Test data: 0
4.7 Handle Outliers¶
def handle_outliers(df, y=None, dataset_name=""):
# Track initial shape
initial_shape = df.shape
print(f"\nProcessing {dataset_name} dataset:")
print(f"Initial shape: {initial_shape}")
# Create copy and process
df_out = df.copy()
numeric_cols = df_out.select_dtypes(include=np.number).columns.tolist()
# Only consider a row an outlier if it has outliers in multiple columns
outlier_count_per_row = pd.Series(0, index=df_out.index)
# Count outliers per column
column_outlier_counts = {}
for col in numeric_cols:
# Skip columns with low variance or mostly identical values
if df_out[col].std() < 0.001 or df_out[col].nunique() < 5:
continue
# Standard statistical outlier detection
Q1 = df_out[col].quantile(0.10) # More conservative percentiles
Q3 = df_out[col].quantile(0.90)
IQR = Q3 - Q1
# Very conservative threshold - only mark extreme outliers
lower = Q1 - 5*IQR
upper = Q3 + 5*IQR
# Identify outliers in this column
col_outliers = (df_out[col] < lower) | (df_out[col] > upper)
column_outlier_counts[col] = col_outliers.sum()
# Increment outlier count for affected rows
outlier_count_per_row += col_outliers
# Only remove rows that are outliers in at least 3 different columns
# This focuses on truly problematic data points
outlier_rows = outlier_count_per_row >= 3
# Get indices of rows to keep
keep_indices = df_out.index[~outlier_rows]
# Drop the identified outlier rows
df_out = df_out.loc[keep_indices]
# If labels are provided, filter them too
if y is not None:
y_out = y.loc[keep_indices]
else:
y_out = None
# Track final shape
final_shape = df_out.shape
print(f"Final shape: {final_shape}")
print(f"Rows maintained: {final_shape[0]} ({(final_shape[0]/initial_shape[0])*100:.1f}%)")
print(f"Outlier rows removed: {initial_shape[0] - final_shape[0]}")
return df_out, y_out
# Apply to datasets with names and get updated labels
X_train_out, y_train_out = handle_outliers(X_train, y_train, "Training")
X_val_out, y_val_out = handle_outliers(X_val, y_val, "Validation")
X_test_out, y_test_out = handle_outliers(X_test, y_test, "Test")
Processing Training dataset: Initial shape: (157035, 420) Final shape: (155305, 420) Rows maintained: 155305 (98.9%) Outlier rows removed: 1730 Processing Validation dataset: Initial shape: (39259, 420) Final shape: (38829, 420) Rows maintained: 38829 (98.9%) Outlier rows removed: 430 Processing Test dataset: Initial shape: (96256, 420) Final shape: (95180, 420) Rows maintained: 95180 (98.9%) Outlier rows removed: 1076
4.8 Handle Skewness¶
def handle_skewness(df, threshold=.5):
"""Handle skewness with detailed before/after reporting"""
numeric_cols = df.select_dtypes(include=np.number).columns.tolist()
skewed_features = {}
transformation_report = []
# Initial skewness analysis
print(f"\nSkewness analysis (threshold: {threshold}):")
for col in numeric_cols:
skew = df[col].skew()
if abs(skew) > threshold:
skewed_features[col] = {
'initial': skew,
'final': None,
'transformed': False
}
print(f"Found {len(skewed_features)} potentially skewed features")
# Apply transformations
df_transformed = df.copy()
for col, stats in skewed_features.items():
original_skew = stats['initial']
# Check if transformation needed
if abs(original_skew) > threshold:
# Handle non-positive values
if df[col].min() <= 0:
shift = abs(df[col].min()) + 1
transformed = np.log1p(df[col] + shift)
else:
transformed = np.log1p(df[col])
# Calculate new skewness
new_skew = transformed.skew()
# Only apply if improvement occurs
if abs(new_skew) < abs(original_skew):
df_transformed[col] = transformed
stats['final'] = new_skew
stats['transformed'] = True
transformation_report.append(
f"{col}: {original_skew:.2f} ā {new_skew:.2f} (Improved)"
)
else:
transformation_report.append(
f"{col}: {original_skew:.2f} ā {new_skew:.2f} (No change)"
)
stats['final'] = original_skew
stats['transformed'] = False
# Print transformation results
print("\nSkewness transformation results:")
for report in transformation_report:
print(report)
return df_transformed, skewed_features
# Apply to processed datasets
print("\n=== Training Data ===")
X_train_skew, train_skew_info = handle_skewness(X_train_out, threshold=.5)
print("\n=== Validation Data ===")
X_val_skew, val_skew_info = handle_skewness(X_val_out, threshold=.5)
print("\n=== Test Data ===")
X_test_skew, test_skew_info = handle_skewness(X_test_out, threshold=.5)
=== Training Data === Skewness analysis (threshold: 0.5): Found 396 potentially skewed features Skewness transformation results: industry_code: 0.50 ā 0.23 (Improved) occupation_code: 0.81 ā 0.36 (Improved) wage_per_hour: 9.43 ā 3.93 (Improved) capital_gains: 26.80 ā 5.99 (Improved) capital_losses: 7.51 ā 6.85 (Improved) dividends_from_stocks: 29.97 ā 3.36 (Improved) instance_weight: 1.40 ā -0.79 (Improved) own_business_or_self_employed: 2.89 ā 2.85 (Improved) veterans_benefits: -1.26 ā -1.27 (No change) work_experience: 0.77 ā -0.33 (Improved) capital_ratio: 26.80 ā 5.99 (Improved) full_year_worker: 0.54 ā 0.54 (No change) has_capital_gains: 9.71 ā 9.71 (No change) has_capital_losses: 6.83 ā 6.83 (Improved) has_dividends: 2.68 ā 2.68 (No change) is_married: -2.14 ā -2.14 (Improved) enrolled_in_edu_inst_encoded: 4.20 ā 3.98 (Improved) member_of_labor_union_encoded: 3.43 ā 3.13 (Improved) fill_inc_questionnaire_for_veteran_encoded: 11.65 ā 10.83 (Improved) class_of_worker_ Federal government: 8.02 ā 8.02 (Improved) class_of_worker_ Local government: 4.74 ā 4.74 (No change) class_of_worker_ Never worked: 20.93 ā 20.93 (Improved) class_of_worker_ Private: 0.56 ā 0.56 (No change) class_of_worker_ Self-employed-incorporated: 7.77 ā 7.77 (No change) class_of_worker_ Self-employed-not incorporated: 4.52 ā 4.52 (Improved) class_of_worker_ State government: 6.60 ā 6.60 (No change) class_of_worker_ Without pay: 34.00 ā 34.00 (No change) education_ 10th grade: 4.78 ā 4.78 (No change) education_ 11th grade: 5.05 ā 5.05 (No change) education_ 12th grade no diploma: 9.46 ā 9.46 (No change) education_ 1st 2nd 3rd or 4th grade: 10.39 ā 10.39 (Improved) education_ 5th or 6th grade: 7.48 ā 7.48 (Improved) education_ 7th and 8th grade: 4.63 ā 4.63 (No change) education_ 9th grade: 5.36 ā 5.36 (No change) education_ Associates degree-academic program: 6.48 ā 6.48 (Improved) education_ Associates degree-occup /vocational: 5.83 ā 5.83 (No change) education_ Bachelors degree(BA AB BS): 2.68 ā 2.68 (No change) education_ Children: 1.29 ā 1.29 (No change) education_ Doctorate degree(PhD EdD): 12.59 ā 12.59 (No change) education_ High school graduate: 1.17 ā 1.17 (No change) education_ Less than 1st grade: 15.24 ā 15.24 (Improved) education_ Masters degree(MA MS MEng MEd MSW MBA): 5.30 ā 5.30 (Improved) education_ Prof school degree (MD DDS DVM LLB JD): 10.77 ā 10.77 (Improved) education_ Some college but no degree: 2.06 ā 2.06 (No change) enrolled_in_edu_inst_ College or university: 5.58 ā 5.58 (No change) enrolled_in_edu_inst_ High school: 5.06 ā 5.06 (No change) enrolled_in_edu_inst_ Not in universe: -3.55 ā -3.55 (Improved) marital_status_ Divorced: 3.54 ā 3.54 (Improved) marital_status_ Married-A F spouse present: 16.84 ā 16.84 (Improved) marital_status_ Married-spouse absent: 11.29 ā 11.29 (No change) marital_status_ Separated: 7.35 ā 7.35 (Improved) marital_status_ Widowed: 4.00 ā 4.00 (No change) major_industry_code_ Agriculture: 7.90 ā 7.90 (Improved) major_industry_code_ Armed Forces: 73.16 ā 73.16 (Improved) major_industry_code_ Business and repair services: 5.66 ā 5.66 (Improved) major_industry_code_ Communications: 12.93 ā 12.93 (Improved) major_industry_code_ Construction: 5.47 ā 5.47 (No change) major_industry_code_ Education: 4.56 ā 4.56 (No change) major_industry_code_ Entertainment: 10.77 ā 10.77 (No change) major_industry_code_ Finance insurance and real estate: 5.45 ā 5.45 (Improved) major_industry_code_ Forestry and fisheries: 32.68 ā 32.68 (No change) major_industry_code_ Hospital services: 6.86 ā 6.86 (No change) major_industry_code_ Manufacturing-durable goods: 4.38 ā 4.38 (Improved) major_industry_code_ Manufacturing-nondurable goods: 5.10 ā 5.10 (Improved) major_industry_code_ Medical except hospital: 6.29 ā 6.29 (Improved) major_industry_code_ Mining: 18.95 ā 18.95 (Improved) major_industry_code_ Other professional services: 6.44 ā 6.44 (No change) major_industry_code_ Personal services except private HH: 7.99 ā 7.99 (Improved) major_industry_code_ Private household services: 14.30 ā 14.30 (Improved) major_industry_code_ Public administration: 6.29 ā 6.29 (Improved) major_industry_code_ Retail trade: 2.92 ā 2.92 (No change) major_industry_code_ Social services: 8.58 ā 8.58 (Improved) major_industry_code_ Transportation: 6.61 ā 6.61 (Improved) major_industry_code_ Utilities and sanitary services: 12.91 ā 12.91 (No change) major_industry_code_ Wholesale trade: 7.26 ā 7.26 (Improved) major_occupation_code_ Adm support including clerical: 3.21 ā 3.21 (Improved) major_occupation_code_ Armed Forces: 73.16 ā 73.16 (Improved) major_occupation_code_ Executive admin and managerial: 3.63 ā 3.63 (No change) major_occupation_code_ Farming forestry and fishing: 7.75 ā 7.75 (Improved) major_occupation_code_ Handlers equip cleaners etc : 6.61 ā 6.61 (Improved) major_occupation_code_ Machine operators assmblrs & inspctrs: 5.29 ā 5.29 (No change) major_occupation_code_ Other service: 3.63 ā 3.63 (Improved) major_occupation_code_ Precision production craft & repair: 4.00 ā 4.00 (No change) major_occupation_code_ Private household services: 15.76 ā 15.76 (No change) major_occupation_code_ Professional specialty: 3.38 ā 3.38 (Improved) major_occupation_code_ Protective services: 10.60 ā 10.60 (No change) major_occupation_code_ Sales: 3.72 ā 3.72 (Improved) major_occupation_code_ Technicians and related support: 7.96 ā 7.96 (Improved) major_occupation_code_ Transportation and material moving: 6.78 ā 6.78 (No change) race_ Amer Indian Aleut or Eskimo: 9.23 ā 9.23 (Improved) race_ Asian or Pacific Islander: 5.54 ā 5.54 (No change) race_ Black: 2.59 ā 2.59 (Improved) race_ Other: 7.07 ā 7.07 (Improved) race_ White: -1.81 ā -1.81 (Improved) hispanic_origin_ All other: -2.06 ā -2.06 (Improved) hispanic_origin_ Central or South American: 6.89 ā 6.89 (Improved) hispanic_origin_ Chicano: 24.72 ā 24.72 (No change) hispanic_origin_ Cuban: 13.17 ā 13.17 (Improved) hispanic_origin_ Do not know: 25.07 ā 25.07 (Improved) hispanic_origin_ Mexican (Mexicano): 4.89 ā 4.89 (No change) hispanic_origin_ Mexican-American: 4.60 ā 4.60 (No change) hispanic_origin_ NA: 15.14 ā 15.14 (Improved) hispanic_origin_ Other Spanish: 8.67 ā 8.67 (Improved) hispanic_origin_ Puerto Rican: 7.49 ā 7.49 (Improved) member_of_labor_union_ No: 3.10 ā 3.10 (No change) member_of_labor_union_ Not in universe: -2.77 ā -2.77 (Improved) member_of_labor_union_ Yes: 8.08 ā 8.08 (Improved) reason_for_unemployment_ Job leaver: 17.92 ā 17.92 (Improved) reason_for_unemployment_ Job loser - on layoff: 14.00 ā 14.00 (Improved) reason_for_unemployment_ New entrant: 20.93 ā 20.93 (Improved) reason_for_unemployment_ Not in universe: -5.40 ā -5.40 (Improved) reason_for_unemployment_ Other job loser: 9.59 ā 9.59 (Improved) reason_for_unemployment_ Re-entrant: 9.72 ā 9.72 (No change) full_or_part_time_employment_ Full-time schedules: 1.46 ā 1.46 (No change) full_or_part_time_employment_ Not in labor force: 2.11 ā 2.11 (No change) full_or_part_time_employment_ PT for econ reasons usually FT: 19.22 ā 19.22 (No change) full_or_part_time_employment_ PT for econ reasons usually PT: 12.76 ā 12.76 (Improved) full_or_part_time_employment_ PT for non-econ reasons usually FT: 7.51 ā 7.51 (Improved) full_or_part_time_employment_ Unemployed full-time: 9.02 ā 9.02 (No change) full_or_part_time_employment_ Unemployed part- time: 15.24 ā 15.24 (Improved) tax_filer_status_ Head of household: 4.83 ā 4.83 (Improved) tax_filer_status_ Joint both 65+: 4.59 ā 4.59 (No change) tax_filer_status_ Joint both under 65: 0.67 ā 0.67 (No change) tax_filer_status_ Joint one under 65 & one 65+: 6.94 ā 6.94 (Improved) tax_filer_status_ Nonfiler: 0.53 ā 0.53 (No change) tax_filer_status_ Single: 1.59 ā 1.59 (Improved) region_of_previous_residence_ Abroad: 19.02 ā 19.02 (No change) region_of_previous_residence_ Midwest: 7.17 ā 7.17 (Improved) region_of_previous_residence_ Northeast: 8.37 ā 8.37 (Improved) region_of_previous_residence_ Not in universe: -3.08 ā -3.08 (Improved) region_of_previous_residence_ South: 6.09 ā 6.09 (Improved) region_of_previous_residence_ West: 6.74 ā 6.74 (No change) state_of_previous_residence_ ?: 16.52 ā 16.52 (No change) state_of_previous_residence_ Abroad: 16.85 ā 16.85 (No change) state_of_previous_residence_ Alabama: 29.57 ā 29.57 (No change) state_of_previous_residence_ Alaska: 26.33 ā 26.33 (No change) state_of_previous_residence_ Arizona: 28.61 ā 28.61 (No change) state_of_previous_residence_ Arkansas: 30.63 ā 30.63 (Improved) state_of_previous_residence_ California: 10.60 ā 10.60 (Improved) state_of_previous_residence_ Colorado: 28.54 ā 28.54 (Improved) state_of_previous_residence_ Connecticut: 40.40 ā 40.40 (Improved) state_of_previous_residence_ Delaware: 53.11 ā 53.11 (Improved) state_of_previous_residence_ District of Columbia: 41.97 ā 41.97 (Improved) state_of_previous_residence_ Florida: 15.14 ā 15.14 (Improved) state_of_previous_residence_ Georgia: 28.92 ā 28.92 (No change) state_of_previous_residence_ Idaho: 82.16 ā 82.16 (Improved) state_of_previous_residence_ Illinois: 32.57 ā 32.57 (No change) state_of_previous_residence_ Indiana: 18.99 ā 18.99 (No change) state_of_previous_residence_ Iowa: 32.02 ā 32.02 (Improved) state_of_previous_residence_ Kansas: 37.20 ā 37.20 (No change) state_of_previous_residence_ Kentucky: 28.17 ā 28.17 (Improved) state_of_previous_residence_ Louisiana: 32.24 ā 32.24 (Improved) state_of_previous_residence_ Maine: 35.49 ā 35.49 (No change) state_of_previous_residence_ Maryland: 38.06 ā 38.06 (No change) state_of_previous_residence_ Massachusetts: 34.93 ā 34.93 (Improved) state_of_previous_residence_ Michigan: 20.53 ā 20.53 (Improved) state_of_previous_residence_ Minnesota: 18.33 ā 18.33 (Improved) state_of_previous_residence_ Mississippi: 30.63 ā 30.63 (Improved) state_of_previous_residence_ Missouri: 32.68 ā 32.68 (No change) state_of_previous_residence_ Montana: 28.54 ā 28.54 (Improved) state_of_previous_residence_ Nebraska: 32.68 ā 32.68 (Improved) state_of_previous_residence_ Nevada: 33.03 ā 33.03 (No change) state_of_previous_residence_ New Hampshire: 28.24 ā 28.24 (Improved) state_of_previous_residence_ New Jersey: 52.63 ā 52.63 (No change) state_of_previous_residence_ New Mexico: 21.02 ā 21.02 (Improved) state_of_previous_residence_ New York: 31.51 ā 31.51 (No change) state_of_previous_residence_ North Carolina: 15.52 ā 15.52 (Improved) state_of_previous_residence_ North Dakota: 19.68 ā 19.68 (Improved) state_of_previous_residence_ Not in universe: -3.08 ā -3.08 (Improved) state_of_previous_residence_ Ohio: 30.91 ā 30.91 (No change) state_of_previous_residence_ Oklahoma: 17.36 ā 17.36 (No change) state_of_previous_residence_ Oregon: 28.69 ā 28.69 (No change) state_of_previous_residence_ Pennsylvania: 31.71 ā 31.71 (No change) state_of_previous_residence_ South Carolina: 46.09 ā 46.09 (Improved) state_of_previous_residence_ South Dakota: 37.71 ā 37.71 (Improved) state_of_previous_residence_ Tennessee: 31.30 ā 31.30 (No change) state_of_previous_residence_ Texas: 31.11 ā 31.11 (No change) state_of_previous_residence_ Utah: 13.45 ā 13.45 (Improved) state_of_previous_residence_ Vermont: 32.57 ā 32.57 (No change) state_of_previous_residence_ Virginia: 39.98 ā 39.98 (No change) state_of_previous_residence_ West Virginia: 29.66 ā 29.66 (Improved) state_of_previous_residence_ Wisconsin: 43.75 ā 43.75 (Improved) state_of_previous_residence_ Wyoming: 28.69 ā 28.69 (No change) detailed_household_summary_ Child 18+ ever marr Not in a subfamily: 13.76 ā 13.76 (Improved) detailed_household_summary_ Child 18+ ever marr RP of subfamily: 17.21 ā 17.21 (Improved) detailed_household_summary_ Child 18+ never marr Not in a subfamily: 3.64 ā 3.64 (Improved) detailed_household_summary_ Child 18+ never marr RP of subfamily: 18.17 ā 18.17 (Improved) detailed_household_summary_ Child 18+ spouse of subfamily RP: 38.24 ā 38.24 (No change) detailed_household_summary_ Child <18 ever marr RP of subfamily: 160.88 ā 160.88 (Improved) detailed_household_summary_ Child <18 ever marr not in subfamily: 84.00 ā 84.00 (Improved) detailed_household_summary_ Child <18 never marr RP of subfamily: 48.85 ā 48.85 (No change) detailed_household_summary_ Child <18 never marr not in subfamily: 1.20 ā 1.20 (No change) detailed_household_summary_ Child <18 spouse of subfamily RP: 278.66 ā 278.66 (Improved) detailed_household_summary_ Child under 18 of RP of unrel subfamily: 16.15 ā 16.15 (Improved) detailed_household_summary_ Grandchild 18+ ever marr RP of subfamily: 139.32 ā 139.32 (Improved) detailed_household_summary_ Grandchild 18+ ever marr not in subfamily: 75.82 ā 75.82 (Improved) detailed_household_summary_ Grandchild 18+ never marr RP of subfamily: 160.88 ā 160.88 (Improved) detailed_household_summary_ Grandchild 18+ never marr not in subfamily: 22.61 ā 22.61 (Improved) detailed_household_summary_ Grandchild 18+ spouse of subfamily RP: 131.35 ā 131.35 (No change) detailed_household_summary_ Grandchild <18 never marr RP of subfamily: 394.09 ā 394.09 (No change) detailed_household_summary_ Grandchild <18 never marr child of subfamily RP: 9.99 ā 9.99 (No change) detailed_household_summary_ Grandchild <18 never marr not in subfamily: 13.58 ā 13.58 (Improved) detailed_household_summary_ Householder: 1.06 ā 1.06 (No change) detailed_household_summary_ In group quarters: 31.81 ā 31.81 (No change) detailed_household_summary_ Nonfamily householder: 2.46 ā 2.46 (Improved) detailed_household_summary_ Other Rel 18+ ever marr RP of subfamily: 17.03 ā 17.03 (No change) detailed_household_summary_ Other Rel 18+ ever marr not in subfamily: 9.84 ā 9.84 (No change) detailed_household_summary_ Other Rel 18+ never marr RP of subfamily: 46.74 ā 46.74 (Improved) detailed_household_summary_ Other Rel 18+ never marr not in subfamily: 10.45 ā 10.45 (Improved) detailed_household_summary_ Other Rel 18+ spouse of subfamily RP: 17.42 ā 17.42 (No change) detailed_household_summary_ Other Rel <18 ever marr RP of subfamily: 160.88 ā 160.88 (Improved) detailed_household_summary_ Other Rel <18 ever marr not in subfamily: 394.09 ā 394.09 (No change) detailed_household_summary_ Other Rel <18 never marr child of subfamily RP: 17.15 ā 17.15 (Improved) detailed_household_summary_ Other Rel <18 never marr not in subfamily: 17.85 ā 17.85 (Improved) detailed_household_summary_ Other Rel <18 never married RP of subfamily: 197.04 ā 197.04 (No change) detailed_household_summary_ Other Rel <18 spouse of subfamily RP: 278.66 ā 278.66 (Improved) detailed_household_summary_ RP of unrelated subfamily: 16.40 ā 16.40 (Improved) detailed_household_summary_ Secondary individual: 5.38 ā 5.38 (Improved) detailed_household_summary_ Spouse of RP of unrelated subfamily: 58.72 ā 58.72 (Improved) detailed_household_summary_ Spouse of householder: 1.39 ā 1.39 (No change) detailed_household_summary_in_household_ Child 18 or older: 3.26 ā 3.26 (No change) detailed_household_summary_in_household_ Child under 18 ever married: 71.93 ā 71.93 (Improved) detailed_household_summary_in_household_ Child under 18 never married: 1.19 ā 1.19 (No change) detailed_household_summary_in_household_ Group Quarters- Secondary individual: 38.61 ā 38.61 (No change) detailed_household_summary_in_household_ Householder: 0.50 ā 0.50 (Improved) detailed_household_summary_in_household_ Nonrelative of householder: 4.75 ā 4.75 (No change) detailed_household_summary_in_household_ Other relative of householder: 4.13 ā 4.13 (Improved) detailed_household_summary_in_household_ Spouse of householder: 1.39 ā 1.39 (No change) migration_code_change_in_msa_ Abroad to MSA: 20.67 ā 20.67 (No change) migration_code_change_in_msa_ Abroad to nonMSA: 49.62 ā 49.62 (Improved) migration_code_change_in_msa_ MSA to MSA: 3.94 ā 3.94 (No change) migration_code_change_in_msa_ MSA to nonMSA: 15.56 ā 15.56 (Improved) migration_code_change_in_msa_ NonMSA to MSA: 18.08 ā 18.08 (Improved) migration_code_change_in_msa_ NonMSA to nonMSA: 8.16 ā 8.16 (Improved) migration_code_change_in_msa_ Not identifiable: 21.33 ā 21.33 (No change) migration_code_change_in_msa_ Not in universe: 11.43 ā 11.43 (Improved) migration_code_change_in_reg_ Abroad: 19.02 ā 19.02 (No change) migration_code_change_in_reg_ Different county same state: 8.12 ā 8.12 (No change) migration_code_change_in_reg_ Different division same region: 21.12 ā 21.12 (No change) migration_code_change_in_reg_ Different region: 12.84 ā 12.84 (No change) migration_code_change_in_reg_ Different state same division: 14.00 ā 14.00 (Improved) migration_code_change_in_reg_ Not in universe: 11.43 ā 11.43 (Improved) migration_code_change_in_reg_ Same county: 4.13 ā 4.13 (No change) migration_code_move_within_reg_ Abroad: 19.02 ā 19.02 (No change) migration_code_move_within_reg_ Different county same state: 8.12 ā 8.12 (No change) migration_code_move_within_reg_ Different state in Midwest: 18.79 ā 18.79 (Improved) migration_code_move_within_reg_ Different state in Northeast: 21.33 ā 21.33 (No change) migration_code_move_within_reg_ Different state in South: 14.20 ā 14.20 (Improved) migration_code_move_within_reg_ Different state in West: 17.25 ā 17.25 (No change) migration_code_move_within_reg_ Not in universe: 11.43 ā 11.43 (Improved) migration_code_move_within_reg_ Same county: 4.13 ā 4.13 (No change) live_in_this_house_1_year_ago_ No: 3.08 ā 3.08 (Improved) migration_prev_res_in_sunbelt_ No: 4.09 ā 4.09 (Improved) migration_prev_res_in_sunbelt_ Yes: 5.55 ā 5.55 (No change) family_members_under_18_ Both parents present: 1.62 ā 1.62 (Improved) family_members_under_18_ Father only present: 10.04 ā 10.04 (Improved) family_members_under_18_ Mother only present: 3.53 ā 3.53 (Improved) family_members_under_18_ Neither parent present: 10.73 ā 10.73 (Improved) family_members_under_18_ Not in universe: -1.04 ā -1.04 (Improved) country_of_birth_father_ ?: 5.15 ā 5.15 (No change) country_of_birth_father_ Cambodia: 31.81 ā 31.81 (No change) country_of_birth_father_ Canada: 11.88 ā 11.88 (Improved) country_of_birth_father_ China: 14.90 ā 14.90 (Improved) country_of_birth_father_ Columbia: 17.83 ā 17.83 (No change) country_of_birth_father_ Cuba: 13.10 ā 13.10 (Improved) country_of_birth_father_ Dominican-Republic: 12.12 ā 12.12 (Improved) country_of_birth_father_ Ecuador: 22.84 ā 22.84 (Improved) country_of_birth_father_ El-Salvador: 13.87 ā 13.87 (No change) country_of_birth_father_ England: 16.15 ā 16.15 (Improved) country_of_birth_father_ France: 32.13 ā 32.13 (Improved) country_of_birth_father_ Germany: 11.83 ā 11.83 (Improved) country_of_birth_father_ Greece: 24.06 ā 24.06 (Improved) country_of_birth_father_ Guatemala: 21.21 ā 21.21 (Improved) country_of_birth_father_ Haiti: 23.70 ā 23.70 (Improved) country_of_birth_father_ Holand-Netherlands: 63.08 ā 63.08 (Improved) country_of_birth_father_ Honduras: 31.51 ā 31.51 (No change) country_of_birth_father_ Hong Kong: 42.71 ā 42.71 (Improved) country_of_birth_father_ Hungary: 25.98 ā 25.98 (No change) country_of_birth_father_ India: 18.48 ā 18.48 (Improved) country_of_birth_father_ Iran: 29.00 ā 29.00 (No change) country_of_birth_father_ Ireland: 19.27 ā 19.27 (Improved) country_of_birth_father_ Italy: 9.25 ā 9.25 (Improved) country_of_birth_father_ Jamaica: 20.44 ā 20.44 (No change) country_of_birth_father_ Japan: 22.39 ā 22.39 (No change) country_of_birth_father_ Laos: 36.39 ā 36.39 (No change) country_of_birth_father_ Mexico: 4.07 ā 4.07 (Improved) country_of_birth_father_ Nicaragua: 25.65 ā 25.65 (No change) country_of_birth_father_ Outlying-U S (Guam USVI etc): 33.26 ā 33.26 (No change) country_of_birth_father_ Panama: 90.39 ā 90.39 (Improved) country_of_birth_father_ Peru: 23.57 ā 23.57 (Improved) country_of_birth_father_ Philippines: 12.99 ā 12.99 (Improved) country_of_birth_father_ Poland: 12.72 ā 12.72 (Improved) country_of_birth_father_ Portugal: 22.43 ā 22.43 (No change) country_of_birth_father_ Puerto-Rico: 8.38 ā 8.38 (No change) country_of_birth_father_ Scotland: 28.39 ā 28.39 (Improved) country_of_birth_father_ South Korea: 19.11 ā 19.11 (Improved) country_of_birth_father_ Taiwan: 31.30 ā 31.30 (No change) country_of_birth_father_ Thailand: 42.22 ā 42.22 (Improved) country_of_birth_father_ Trinadad&Tobago: 40.83 ā 40.83 (No change) country_of_birth_father_ United-States: -1.46 ā -1.46 (No change) country_of_birth_father_ Vietnam: 20.61 ā 20.61 (Improved) country_of_birth_father_ Yugoslavia: 28.92 ā 28.92 (No change) country_of_birth_mother_ ?: 5.42 ā 5.42 (Improved) country_of_birth_mother_ Cambodia: 35.78 ā 35.78 (No change) country_of_birth_mother_ Canada: 11.61 ā 11.61 (Improved) country_of_birth_mother_ China: 15.78 ā 15.78 (No change) country_of_birth_mother_ Columbia: 17.81 ā 17.81 (No change) country_of_birth_mother_ Cuba: 13.19 ā 13.19 (Improved) country_of_birth_mother_ Dominican-Republic: 13.07 ā 13.07 (Improved) country_of_birth_mother_ Ecuador: 23.24 ā 23.24 (Improved) country_of_birth_mother_ El-Salvador: 13.13 ā 13.13 (Improved) country_of_birth_mother_ England: 15.04 ā 15.04 (Improved) country_of_birth_mother_ France: 30.63 ā 30.63 (Improved) country_of_birth_mother_ Germany: 11.72 ā 11.72 (No change) country_of_birth_mother_ Greece: 27.81 ā 27.81 (No change) country_of_birth_mother_ Guatemala: 21.15 ā 21.15 (No change) country_of_birth_mother_ Haiti: 23.74 ā 23.74 (No change) country_of_birth_mother_ Holand-Netherlands: 63.08 ā 63.08 (Improved) country_of_birth_mother_ Honduras: 29.16 ā 29.16 (Improved) country_of_birth_mother_ Hong Kong: 42.22 ā 42.22 (No change) country_of_birth_mother_ Hungary: 26.33 ā 26.33 (No change) country_of_birth_mother_ India: 18.39 ā 18.39 (Improved) country_of_birth_mother_ Iran: 31.61 ā 31.61 (Improved) country_of_birth_mother_ Ireland: 17.66 ā 17.66 (Improved) country_of_birth_mother_ Italy: 10.18 ā 10.18 (No change) country_of_birth_mother_ Jamaica: 20.67 ā 20.67 (No change) country_of_birth_mother_ Japan: 20.41 ā 20.41 (Improved) country_of_birth_mother_ Laos: 35.93 ā 35.93 (No change) country_of_birth_mother_ Mexico: 4.12 ā 4.12 (Improved) country_of_birth_mother_ Nicaragua: 26.22 ā 26.22 (No change) country_of_birth_mother_ Outlying-U S (Guam USVI etc): 34.00 ā 34.00 (No change) country_of_birth_mother_ Panama: 78.80 ā 78.80 (Improved) country_of_birth_mother_ Peru: 23.00 ā 23.00 (No change) country_of_birth_mother_ Philippines: 12.59 ā 12.59 (Improved) country_of_birth_mother_ Poland: 13.28 ā 13.28 (No change) country_of_birth_mother_ Portugal: 23.83 ā 23.83 (Improved) country_of_birth_mother_ Puerto-Rico: 8.74 ā 8.74 (Improved) country_of_birth_mother_ Scotland: 27.88 ā 27.88 (Improved) country_of_birth_mother_ South Korea: 17.76 ā 17.76 (Improved) country_of_birth_mother_ Taiwan: 29.40 ā 29.40 (Improved) country_of_birth_mother_ Thailand: 39.18 ā 39.18 (Improved) country_of_birth_mother_ Trinadad&Tobago: 43.49 ā 43.49 (Improved) country_of_birth_mother_ United-States: -1.51 ā -1.51 (Improved) country_of_birth_mother_ Vietnam: 20.30 ā 20.30 (No change) country_of_birth_mother_ Yugoslavia: 32.35 ā 32.35 (No change) country_of_birth_self_ ?: 7.41 ā 7.41 (Improved) country_of_birth_self_ Cambodia: 44.88 ā 44.88 (Improved) country_of_birth_self_ Canada: 16.87 ā 16.87 (Improved) country_of_birth_self_ China: 19.85 ā 19.85 (No change) country_of_birth_self_ Columbia: 21.30 ā 21.30 (No change) country_of_birth_self_ Cuba: 15.21 ā 15.21 (Improved) country_of_birth_self_ Dominican-Republic: 16.59 ā 16.59 (Improved) country_of_birth_self_ Ecuador: 27.95 ā 27.95 (Improved) country_of_birth_self_ El-Salvador: 16.85 ā 16.85 (No change) country_of_birth_self_ England: 21.43 ā 21.43 (Improved) country_of_birth_self_ France: 40.40 ā 40.40 (No change) country_of_birth_self_ Germany: 15.11 ā 15.11 (No change) country_of_birth_self_ Greece: 37.37 ā 37.37 (No change) country_of_birth_self_ Guatemala: 23.92 ā 23.92 (Improved) country_of_birth_self_ Haiti: 29.83 ā 29.83 (No change) country_of_birth_self_ Holand-Netherlands: 101.74 ā 101.74 (Improved) country_of_birth_self_ Honduras: 36.08 ā 36.08 (No change) country_of_birth_self_ Hong Kong: 43.22 ā 43.22 (Improved) country_of_birth_self_ Hungary: 50.85 ā 50.85 (Improved) country_of_birth_self_ India: 22.24 ā 22.24 (Improved) country_of_birth_self_ Iran: 35.49 ā 35.49 (No change) country_of_birth_self_ Ireland: 38.06 ā 38.06 (No change) country_of_birth_self_ Italy: 21.43 ā 21.43 (Improved) country_of_birth_self_ Jamaica: 24.52 ā 24.52 (No change) country_of_birth_self_ Japan: 24.38 ā 24.38 (Improved) country_of_birth_self_ Laos: 42.96 ā 42.96 (No change) country_of_birth_self_ Mexico: 5.54 ā 5.54 (Improved) country_of_birth_self_ Nicaragua: 31.21 ā 31.21 (No change) country_of_birth_self_ Outlying-U S (Guam USVI etc): 39.57 ā 39.57 (No change) country_of_birth_self_ Panama: 84.00 ā 84.00 (No change) country_of_birth_self_ Peru: 26.22 ā 26.22 (No change) country_of_birth_self_ Philippines: 15.34 ā 15.34 (Improved) country_of_birth_self_ Poland: 23.08 ā 23.08 (Improved) country_of_birth_self_ Portugal: 32.68 ā 32.68 (No change) country_of_birth_self_ Puerto-Rico: 11.69 ā 11.69 (No change) country_of_birth_self_ Scotland: 50.02 ā 50.02 (Improved) country_of_birth_self_ South Korea: 20.14 ā 20.14 (Improved) country_of_birth_self_ Taiwan: 30.82 ā 30.82 (Improved) country_of_birth_self_ Thailand: 41.74 ā 41.74 (Improved) country_of_birth_self_ Trinadad&Tobago: 52.63 ā 52.63 (No change) country_of_birth_self_ United-States: -2.41 ā -2.41 (Improved) country_of_birth_self_ Vietnam: 22.03 ā 22.03 (Improved) country_of_birth_self_ Yugoslavia: 52.63 ā 52.63 (Improved) citizenship_ Foreign born- Not a citizen of U S : 3.41 ā 3.41 (No change) citizenship_ Foreign born- U S citizen by naturalization: 5.54 ā 5.54 (No change) citizenship_ Native- Born abroad of American Parent(s): 10.45 ā 10.45 (Improved) citizenship_ Native- Born in Puerto Rico or U S Outlying: 11.19 ā 11.19 (Improved) citizenship_ Native- Born in the United States: -2.41 ā -2.41 (Improved) fill_inc_questionnaire_for_veteran_ No: 10.98 ā 10.98 (No change) fill_inc_questionnaire_for_veteran_ Not in universe: -9.77 ā -9.77 (Improved) fill_inc_questionnaire_for_veteran_ Yes: 21.93 ā 21.93 (Improved) === Validation Data === Skewness analysis (threshold: 0.5): Found 389 potentially skewed features Skewness transformation results: occupation_code: 0.80 ā 0.34 (Improved) wage_per_hour: 8.49 ā 3.91 (Improved) capital_gains: 24.14 ā 6.00 (Improved) capital_losses: 7.71 ā 6.99 (Improved) dividends_from_stocks: 29.19 ā 3.33 (Improved) instance_weight: 1.58 ā -0.82 (Improved) own_business_or_self_employed: 2.89 ā 2.85 (Improved) veterans_benefits: -1.27 ā -1.28 (No change) work_experience: 0.77 ā -0.34 (Improved) capital_ratio: 24.14 ā 6.00 (Improved) full_year_worker: 0.53 ā 0.53 (No change) has_capital_gains: 9.34 ā 9.34 (Improved) has_capital_losses: 6.98 ā 6.98 (Improved) has_dividends: 2.65 ā 2.65 (No change) is_married: -2.14 ā -2.14 (Improved) enrolled_in_edu_inst_encoded: 4.19 ā 3.96 (Improved) member_of_labor_union_encoded: 3.41 ā 3.11 (Improved) fill_inc_questionnaire_for_veteran_encoded: 12.01 ā 11.16 (Improved) class_of_worker_ Federal government: 8.30 ā 8.30 (No change) class_of_worker_ Local government: 4.69 ā 4.69 (No change) class_of_worker_ Never worked: 21.18 ā 21.18 (Improved) class_of_worker_ Private: 0.56 ā 0.56 (No change) class_of_worker_ Self-employed-incorporated: 7.52 ā 7.52 (Improved) class_of_worker_ Self-employed-not incorporated: 4.49 ā 4.49 (No change) class_of_worker_ State government: 6.59 ā 6.59 (Improved) class_of_worker_ Without pay: 35.35 ā 35.35 (Improved) education_ 10th grade: 4.83 ā 4.83 (No change) education_ 11th grade: 5.02 ā 5.02 (No change) education_ 12th grade no diploma: 9.26 ā 9.26 (Improved) education_ 1st 2nd 3rd or 4th grade: 9.73 ā 9.73 (No change) education_ 5th or 6th grade: 7.63 ā 7.63 (Improved) education_ 7th and 8th grade: 4.65 ā 4.65 (No change) education_ 9th grade: 5.24 ā 5.24 (Improved) education_ Associates degree-academic program: 6.52 ā 6.52 (No change) education_ Associates degree-occup /vocational: 5.74 ā 5.74 (No change) education_ Bachelors degree(BA AB BS): 2.69 ā 2.69 (No change) education_ Children: 1.30 ā 1.30 (Improved) education_ Doctorate degree(PhD EdD): 13.05 ā 13.05 (Improved) education_ High school graduate: 1.18 ā 1.18 (No change) education_ Less than 1st grade: 15.63 ā 15.63 (Improved) education_ Masters degree(MA MS MEng MEd MSW MBA): 5.26 ā 5.26 (Improved) education_ Prof school degree (MD DDS DVM LLB JD): 10.69 ā 10.69 (No change) education_ Some college but no degree: 2.05 ā 2.05 (No change) enrolled_in_edu_inst_ College or university: 5.64 ā 5.64 (No change) enrolled_in_edu_inst_ High school: 4.96 ā 4.96 (No change) enrolled_in_edu_inst_ Not in universe: -3.53 ā -3.53 (Improved) marital_status_ Divorced: 3.59 ā 3.59 (No change) marital_status_ Married-A F spouse present: 17.83 ā 17.83 (No change) marital_status_ Married-spouse absent: 11.13 ā 11.13 (Improved) marital_status_ Separated: 7.21 ā 7.21 (Improved) marital_status_ Widowed: 3.99 ā 3.99 (No change) major_industry_code_ Agriculture: 7.69 ā 7.69 (Improved) major_industry_code_ Armed Forces: 80.43 ā 80.43 (Improved) major_industry_code_ Business and repair services: 5.63 ā 5.63 (Improved) major_industry_code_ Communications: 13.33 ā 13.33 (Improved) major_industry_code_ Construction: 5.46 ā 5.46 (No change) major_industry_code_ Education: 4.56 ā 4.56 (No change) major_industry_code_ Entertainment: 10.79 ā 10.79 (Improved) major_industry_code_ Finance insurance and real estate: 5.34 ā 5.34 (Improved) major_industry_code_ Forestry and fisheries: 32.80 ā 32.80 (No change) major_industry_code_ Hospital services: 6.75 ā 6.75 (Improved) major_industry_code_ Manufacturing-durable goods: 4.35 ā 4.35 (No change) major_industry_code_ Manufacturing-nondurable goods: 4.97 ā 4.97 (No change) major_industry_code_ Medical except hospital: 6.18 ā 6.18 (No change) major_industry_code_ Mining: 18.06 ā 18.06 (Improved) major_industry_code_ Other professional services: 6.54 ā 6.54 (No change) major_industry_code_ Personal services except private HH: 7.92 ā 7.92 (Improved) major_industry_code_ Private household services: 14.08 ā 14.08 (Improved) major_industry_code_ Public administration: 6.49 ā 6.49 (No change) major_industry_code_ Retail trade: 2.96 ā 2.96 (Improved) major_industry_code_ Social services: 8.69 ā 8.69 (Improved) major_industry_code_ Transportation: 6.73 ā 6.73 (No change) major_industry_code_ Utilities and sanitary services: 12.82 ā 12.82 (No change) major_industry_code_ Wholesale trade: 7.05 ā 7.05 (Improved) major_occupation_code_ Adm support including clerical: 3.19 ā 3.19 (No change) major_occupation_code_ Armed Forces: 80.43 ā 80.43 (Improved) major_occupation_code_ Executive admin and managerial: 3.63 ā 3.63 (No change) major_occupation_code_ Farming forestry and fishing: 7.50 ā 7.50 (Improved) major_occupation_code_ Handlers equip cleaners etc : 6.90 ā 6.90 (Improved) major_occupation_code_ Machine operators assmblrs & inspctrs: 5.19 ā 5.19 (Improved) major_occupation_code_ Other service: 3.64 ā 3.64 (No change) major_occupation_code_ Precision production craft & repair: 3.88 ā 3.88 (Improved) major_occupation_code_ Private household services: 15.48 ā 15.48 (Improved) major_occupation_code_ Professional specialty: 3.39 ā 3.39 (No change) major_occupation_code_ Protective services: 11.50 ā 11.50 (No change) major_occupation_code_ Sales: 3.72 ā 3.72 (Improved) major_occupation_code_ Technicians and related support: 7.69 ā 7.69 (No change) major_occupation_code_ Transportation and material moving: 6.81 ā 6.81 (Improved) race_ Amer Indian Aleut or Eskimo: 8.90 ā 8.90 (No change) race_ Asian or Pacific Islander: 5.54 ā 5.54 (Improved) race_ Black: 2.65 ā 2.65 (No change) race_ Other: 7.22 ā 7.22 (No change) race_ White: -1.85 ā -1.85 (Improved) hispanic_origin_ All other: -2.06 ā -2.06 (Improved) hispanic_origin_ Central or South American: 6.73 ā 6.73 (Improved) hispanic_origin_ Chicano: 28.39 ā 28.39 (Improved) hispanic_origin_ Cuban: 12.68 ā 12.68 (Improved) hispanic_origin_ Do not know: 26.28 ā 26.28 (Improved) hispanic_origin_ Mexican (Mexicano): 4.92 ā 4.92 (No change) hispanic_origin_ Mexican-American: 4.69 ā 4.69 (Improved) hispanic_origin_ NA: 14.23 ā 14.23 (Improved) hispanic_origin_ Other Spanish: 8.83 ā 8.83 (No change) hispanic_origin_ Puerto Rican: 7.39 ā 7.39 (No change) member_of_labor_union_ No: 3.08 ā 3.08 (Improved) member_of_labor_union_ Not in universe: -2.76 ā -2.76 (Improved) member_of_labor_union_ Yes: 8.06 ā 8.06 (Improved) reason_for_unemployment_ Job leaver: 18.29 ā 18.29 (Improved) reason_for_unemployment_ Job loser - on layoff: 14.01 ā 14.01 (Improved) reason_for_unemployment_ New entrant: 21.18 ā 21.18 (Improved) reason_for_unemployment_ Not in universe: -5.39 ā -5.39 (Improved) reason_for_unemployment_ Other job loser: 9.79 ā 9.79 (Improved) reason_for_unemployment_ Re-entrant: 9.41 ā 9.41 (Improved) full_or_part_time_employment_ Full-time schedules: 1.44 ā 1.44 (No change) full_or_part_time_employment_ Not in labor force: 2.13 ā 2.13 (No change) full_or_part_time_employment_ PT for econ reasons usually FT: 19.25 ā 19.25 (Improved) full_or_part_time_employment_ PT for econ reasons usually PT: 12.42 ā 12.42 (Improved) full_or_part_time_employment_ PT for non-econ reasons usually FT: 7.61 ā 7.61 (Improved) full_or_part_time_employment_ Unemployed full-time: 9.03 ā 9.03 (Improved) full_or_part_time_employment_ Unemployed part- time: 14.75 ā 14.75 (Improved) tax_filer_status_ Head of household: 4.95 ā 4.95 (No change) tax_filer_status_ Joint both 65+: 4.58 ā 4.58 (No change) tax_filer_status_ Joint both under 65: 0.65 ā 0.65 (No change) tax_filer_status_ Joint one under 65 & one 65+: 7.20 ā 7.20 (Improved) tax_filer_status_ Nonfiler: 0.55 ā 0.55 (No change) tax_filer_status_ Single: 1.57 ā 1.57 (Improved) region_of_previous_residence_ Abroad: 19.53 ā 19.53 (No change) region_of_previous_residence_ Midwest: 7.40 ā 7.40 (No change) region_of_previous_residence_ Northeast: 8.24 ā 8.24 (No change) region_of_previous_residence_ Not in universe: -3.08 ā -3.08 (Improved) region_of_previous_residence_ South: 6.08 ā 6.08 (No change) region_of_previous_residence_ West: 6.58 ā 6.58 (No change) state_of_previous_residence_ ?: 16.75 ā 16.75 (Improved) state_of_previous_residence_ Abroad: 17.47 ā 17.47 (No change) state_of_previous_residence_ Alabama: 31.92 ā 31.92 (No change) state_of_previous_residence_ Alaska: 24.38 ā 24.38 (Improved) state_of_previous_residence_ Arizona: 27.01 ā 27.01 (No change) state_of_previous_residence_ Arkansas: 31.92 ā 31.92 (No change) state_of_previous_residence_ California: 10.34 ā 10.34 (Improved) state_of_previous_residence_ Colorado: 28.39 ā 28.39 (Improved) state_of_previous_residence_ Connecticut: 42.97 ā 42.97 (Improved) state_of_previous_residence_ Delaware: 46.41 ā 46.41 (Improved) state_of_previous_residence_ District of Columbia: 40.19 ā 40.19 (No change) state_of_previous_residence_ Florida: 14.93 ā 14.93 (Improved) state_of_previous_residence_ Georgia: 30.36 ā 30.36 (Improved) state_of_previous_residence_ Idaho: 80.43 ā 80.43 (Improved) state_of_previous_residence_ Illinois: 34.26 ā 34.26 (No change) state_of_previous_residence_ Indiana: 19.63 ā 19.63 (No change) state_of_previous_residence_ Iowa: 31.92 ā 31.92 (No change) state_of_previous_residence_ Kansas: 32.80 ā 32.80 (No change) state_of_previous_residence_ Kentucky: 29.00 ā 29.00 (Improved) state_of_previous_residence_ Louisiana: 30.00 ā 30.00 (No change) state_of_previous_residence_ Maine: 30.00 ā 30.00 (No change) state_of_previous_residence_ Maryland: 38.61 ā 38.61 (Improved) state_of_previous_residence_ Massachusetts: 40.19 ā 40.19 (Improved) state_of_previous_residence_ Michigan: 23.16 ā 23.16 (Improved) state_of_previous_residence_ Minnesota: 18.88 ā 18.88 (Improved) state_of_previous_residence_ Mississippi: 32.80 ā 32.80 (No change) state_of_previous_residence_ Missouri: 37.20 ā 37.20 (Improved) state_of_previous_residence_ Montana: 33.75 ā 33.75 (No change) state_of_previous_residence_ Nebraska: 35.94 ā 35.94 (No change) state_of_previous_residence_ Nevada: 34.79 ā 34.79 (No change) state_of_previous_residence_ New Hampshire: 29.32 ā 29.32 (No change) state_of_previous_residence_ New Jersey: 46.41 ā 46.41 (Improved) state_of_previous_residence_ New Mexico: 18.88 ā 18.88 (Improved) state_of_previous_residence_ New York: 31.51 ā 31.51 (Improved) state_of_previous_residence_ North Carolina: 15.20 ā 15.20 (No change) state_of_previous_residence_ North Dakota: 20.04 ā 20.04 (Improved) state_of_previous_residence_ Not in universe: -3.08 ā -3.08 (Improved) state_of_previous_residence_ Ohio: 28.69 ā 28.69 (Improved) state_of_previous_residence_ Oklahoma: 19.06 ā 19.06 (No change) state_of_previous_residence_ Oregon: 29.00 ā 29.00 (Improved) state_of_previous_residence_ Pennsylvania: 29.66 ā 29.66 (No change) state_of_previous_residence_ South Carolina: 41.98 ā 41.98 (No change) state_of_previous_residence_ South Dakota: 36.55 ā 36.55 (No change) state_of_previous_residence_ Tennessee: 30.36 ā 30.36 (No change) state_of_previous_residence_ Texas: 29.32 ā 29.32 (No change) state_of_previous_residence_ Utah: 13.39 ā 13.39 (No change) state_of_previous_residence_ Vermont: 29.66 ā 29.66 (No change) state_of_previous_residence_ Virginia: 37.20 ā 37.20 (Improved) state_of_previous_residence_ West Virginia: 26.51 ā 26.51 (No change) state_of_previous_residence_ Wisconsin: 40.19 ā 40.19 (No change) state_of_previous_residence_ Wyoming: 28.69 ā 28.69 (Improved) detailed_household_summary_ Child 18+ ever marr Not in a subfamily: 13.97 ā 13.97 (No change) detailed_household_summary_ Child 18+ ever marr RP of subfamily: 16.11 ā 16.11 (Improved) detailed_household_summary_ Child 18+ never marr Not in a subfamily: 3.64 ā 3.64 (No change) detailed_household_summary_ Child 18+ never marr RP of subfamily: 17.76 ā 17.76 (Improved) detailed_household_summary_ Child 18+ spouse of subfamily RP: 44.03 ā 44.03 (No change) detailed_household_summary_ Child <18 ever marr RP of subfamily: 113.76 ā 113.76 (No change) detailed_household_summary_ Child <18 ever marr not in subfamily: 52.64 ā 52.64 (Improved) detailed_household_summary_ Child <18 never marr RP of subfamily: 50.85 ā 50.85 (Improved) detailed_household_summary_ Child <18 never marr not in subfamily: 1.19 ā 1.19 (No change) detailed_household_summary_ Child under 18 of RP of unrel subfamily: 16.56 ā 16.56 (Improved) detailed_household_summary_ Grandchild 18+ ever marr RP of subfamily: 197.05 ā 197.05 (No change) detailed_household_summary_ Grandchild 18+ ever marr not in subfamily: 74.46 ā 74.46 (Improved) detailed_household_summary_ Grandchild 18+ never marr not in subfamily: 23.00 ā 23.00 (No change) detailed_household_summary_ Grandchild 18+ spouse of subfamily RP: 197.05 ā 197.05 (No change) detailed_household_summary_ Grandchild <18 never marr RP of subfamily: 197.05 ā 197.05 (No change) detailed_household_summary_ Grandchild <18 never marr child of subfamily RP: 10.74 ā 10.74 (Improved) detailed_household_summary_ Grandchild <18 never marr not in subfamily: 12.99 ā 12.99 (No change) detailed_household_summary_ Householder: 1.05 ā 1.05 (No change) detailed_household_summary_ In group quarters: 31.11 ā 31.11 (Improved) detailed_household_summary_ Nonfamily householder: 2.48 ā 2.48 (No change) detailed_household_summary_ Other Rel 18+ ever marr RP of subfamily: 18.14 ā 18.14 (Improved) detailed_household_summary_ Other Rel 18+ ever marr not in subfamily: 9.78 ā 9.78 (Improved) detailed_household_summary_ Other Rel 18+ never marr RP of subfamily: 41.05 ā 41.05 (No change) detailed_household_summary_ Other Rel 18+ never marr not in subfamily: 10.61 ā 10.61 (Improved) detailed_household_summary_ Other Rel 18+ spouse of subfamily RP: 17.13 ā 17.13 (No change) detailed_household_summary_ Other Rel <18 never marr child of subfamily RP: 17.20 ā 17.20 (No change) detailed_household_summary_ Other Rel <18 never marr not in subfamily: 19.93 ā 19.93 (Improved) detailed_household_summary_ Other Rel <18 spouse of subfamily RP: 197.05 ā 197.05 (No change) detailed_household_summary_ RP of unrelated subfamily: 18.54 ā 18.54 (Improved) detailed_household_summary_ Secondary individual: 5.45 ā 5.45 (No change) detailed_household_summary_ Spouse of RP of unrelated subfamily: 74.46 ā 74.46 (Improved) detailed_household_summary_ Spouse of householder: 1.38 ā 1.38 (No change) detailed_household_summary_in_household_ Child 18 or older: 3.25 ā 3.25 (Improved) detailed_household_summary_in_household_ Child under 18 ever married: 47.76 ā 47.76 (Improved) detailed_household_summary_in_household_ Child under 18 never married: 1.19 ā 1.19 (Improved) detailed_household_summary_in_household_ Group Quarters- Secondary individual: 38.61 ā 38.61 (Improved) detailed_household_summary_in_household_ Nonrelative of householder: 4.88 ā 4.88 (No change) detailed_household_summary_in_household_ Other relative of householder: 4.22 ā 4.22 (No change) detailed_household_summary_in_household_ Spouse of householder: 1.38 ā 1.38 (No change) migration_code_change_in_msa_ Abroad to MSA: 20.82 ā 20.82 (No change) migration_code_change_in_msa_ Abroad to nonMSA: 62.29 ā 62.29 (Improved) migration_code_change_in_msa_ MSA to MSA: 3.94 ā 3.94 (Improved) migration_code_change_in_msa_ MSA to nonMSA: 16.11 ā 16.11 (No change) migration_code_change_in_msa_ NonMSA to MSA: 16.50 ā 16.50 (Improved) migration_code_change_in_msa_ NonMSA to nonMSA: 8.15 ā 8.15 (No change) migration_code_change_in_msa_ Not identifiable: 20.94 ā 20.94 (No change) migration_code_change_in_msa_ Not in universe: 12.17 ā 12.17 (No change) migration_code_change_in_reg_ Abroad: 19.53 ā 19.53 (No change) migration_code_change_in_reg_ Different county same state: 8.45 ā 8.45 (No change) migration_code_change_in_reg_ Different division same region: 18.37 ā 18.37 (Improved) migration_code_change_in_reg_ Different region: 12.68 ā 12.68 (Improved) migration_code_change_in_reg_ Different state same division: 13.76 ā 13.76 (Improved) migration_code_change_in_reg_ Not in universe: 12.17 ā 12.17 (No change) migration_code_change_in_reg_ Same county: 4.11 ā 4.11 (No change) migration_code_move_within_reg_ Abroad: 19.53 ā 19.53 (No change) migration_code_move_within_reg_ Different county same state: 8.45 ā 8.45 (No change) migration_code_move_within_reg_ Different state in Midwest: 19.15 ā 19.15 (No change) migration_code_move_within_reg_ Different state in Northeast: 21.06 ā 21.06 (No change) migration_code_move_within_reg_ Different state in South: 13.62 ā 13.62 (Improved) migration_code_move_within_reg_ Different state in West: 15.73 ā 15.73 (Improved) migration_code_move_within_reg_ Not in universe: 12.17 ā 12.17 (No change) migration_code_move_within_reg_ Same county: 4.11 ā 4.11 (No change) live_in_this_house_1_year_ago_ No: 3.08 ā 3.08 (No change) migration_prev_res_in_sunbelt_ No: 4.08 ā 4.08 (Improved) migration_prev_res_in_sunbelt_ Yes: 5.54 ā 5.54 (Improved) family_members_under_18_ Both parents present: 1.60 ā 1.60 (Improved) family_members_under_18_ Father only present: 10.01 ā 10.01 (Improved) family_members_under_18_ Mother only present: 3.61 ā 3.61 (Improved) family_members_under_18_ Neither parent present: 10.83 ā 10.83 (Improved) family_members_under_18_ Not in universe: -1.05 ā -1.05 (Improved) country_of_birth_father_ ?: 5.12 ā 5.12 (No change) country_of_birth_father_ Cambodia: 30.36 ā 30.36 (No change) country_of_birth_father_ Canada: 11.71 ā 11.71 (Improved) country_of_birth_father_ China: 15.63 ā 15.63 (Improved) country_of_birth_father_ Columbia: 17.26 ā 17.26 (Improved) country_of_birth_father_ Cuba: 12.99 ā 12.99 (Improved) country_of_birth_father_ Dominican-Republic: 12.44 ā 12.44 (Improved) country_of_birth_father_ Ecuador: 21.83 ā 21.83 (Improved) country_of_birth_father_ El-Salvador: 14.34 ā 14.34 (Improved) country_of_birth_father_ England: 14.15 ā 14.15 (Improved) country_of_birth_father_ France: 31.11 ā 31.11 (Improved) country_of_birth_father_ Germany: 12.63 ā 12.63 (Improved) country_of_birth_father_ Greece: 22.84 ā 22.84 (Improved) country_of_birth_father_ Guatemala: 19.83 ā 19.83 (Improved) country_of_birth_father_ Haiti: 22.84 ā 22.84 (Improved) country_of_birth_father_ Holand-Netherlands: 59.39 ā 59.39 (No change) country_of_birth_father_ Honduras: 31.92 ā 31.92 (No change) country_of_birth_father_ Hong Kong: 44.03 ā 44.03 (No change) country_of_birth_father_ Hungary: 24.01 ā 24.01 (No change) country_of_birth_father_ India: 17.76 ā 17.76 (Improved) country_of_birth_father_ Iran: 28.10 ā 28.10 (No change) country_of_birth_father_ Ireland: 21.43 ā 21.43 (Improved) country_of_birth_father_ Italy: 9.41 ā 9.41 (Improved) country_of_birth_father_ Jamaica: 20.82 ā 20.82 (No change) country_of_birth_father_ Japan: 22.84 ā 22.84 (Improved) country_of_birth_father_ Laos: 32.35 ā 32.35 (Improved) country_of_birth_father_ Mexico: 4.10 ā 4.10 (No change) country_of_birth_father_ Nicaragua: 22.10 ā 22.10 (Improved) country_of_birth_father_ Outlying-U S (Guam USVI etc): 45.18 ā 45.18 (Improved) country_of_birth_father_ Panama: 80.43 ā 80.43 (Improved) country_of_birth_father_ Peru: 26.28 ā 26.28 (No change) country_of_birth_father_ Philippines: 12.63 ā 12.63 (Improved) country_of_birth_father_ Poland: 12.71 ā 12.71 (Improved) country_of_birth_father_ Portugal: 22.25 ā 22.25 (Improved) country_of_birth_father_ Puerto-Rico: 8.22 ā 8.22 (Improved) country_of_birth_father_ Scotland: 28.69 ā 28.69 (Improved) country_of_birth_father_ South Korea: 19.43 ā 19.43 (Improved) country_of_birth_father_ Taiwan: 31.92 ā 31.92 (Improved) country_of_birth_father_ Thailand: 44.03 ā 44.03 (No change) country_of_birth_father_ Trinadad&Tobago: 45.18 ā 45.18 (Improved) country_of_birth_father_ United-States: -1.46 ā -1.46 (No change) country_of_birth_father_ Vietnam: 20.36 ā 20.36 (Improved) country_of_birth_father_ Yugoslavia: 35.35 ā 35.35 (Improved) country_of_birth_mother_ ?: 5.41 ā 5.41 (No change) country_of_birth_mother_ Cambodia: 33.75 ā 33.75 (No change) country_of_birth_mother_ Canada: 11.34 ā 11.34 (Improved) country_of_birth_mother_ China: 16.75 ā 16.75 (Improved) country_of_birth_mother_ Columbia: 17.54 ā 17.54 (Improved) country_of_birth_mother_ Cuba: 13.17 ā 13.17 (Improved) country_of_birth_mother_ Dominican-Republic: 13.72 ā 13.72 (Improved) country_of_birth_mother_ Ecuador: 20.94 ā 20.94 (Improved) country_of_birth_mother_ El-Salvador: 13.20 ā 13.20 (Improved) country_of_birth_mother_ England: 13.72 ā 13.72 (Improved) country_of_birth_mother_ France: 29.32 ā 29.32 (No change) country_of_birth_mother_ Germany: 12.50 ā 12.50 (Improved) country_of_birth_mother_ Greece: 25.60 ā 25.60 (No change) country_of_birth_mother_ Guatemala: 20.04 ā 20.04 (No change) country_of_birth_mother_ Haiti: 22.69 ā 22.69 (No change) country_of_birth_mother_ Holand-Netherlands: 65.66 ā 65.66 (No change) country_of_birth_mother_ Honduras: 32.80 ā 32.80 (No change) country_of_birth_mother_ Hong Kong: 44.03 ā 44.03 (No change) country_of_birth_mother_ Hungary: 24.38 ā 24.38 (Improved) country_of_birth_mother_ India: 17.91 ā 17.91 (Improved) country_of_birth_mother_ Iran: 30.00 ā 30.00 (No change) country_of_birth_mother_ Ireland: 20.25 ā 20.25 (No change) country_of_birth_mother_ Italy: 10.17 ā 10.17 (Improved) country_of_birth_mother_ Jamaica: 21.06 ā 21.06 (No change) country_of_birth_mother_ Japan: 20.94 ā 20.94 (Improved) country_of_birth_mother_ Laos: 33.26 ā 33.26 (No change) country_of_birth_mother_ Mexico: 4.16 ā 4.16 (No change) country_of_birth_mother_ Nicaragua: 22.69 ā 22.69 (No change) country_of_birth_mother_ Outlying-U S (Guam USVI etc): 41.05 ā 41.05 (No change) country_of_birth_mother_ Panama: 74.46 ā 74.46 (Improved) country_of_birth_mother_ Peru: 24.97 ā 24.97 (Improved) country_of_birth_mother_ Philippines: 12.17 ā 12.17 (No change) country_of_birth_mother_ Poland: 13.17 ā 13.17 (Improved) country_of_birth_mother_ Portugal: 24.01 ā 24.01 (Improved) country_of_birth_mother_ Puerto-Rico: 8.57 ā 8.57 (No change) country_of_birth_mother_ Scotland: 31.11 ā 31.11 (Improved) country_of_birth_mother_ South Korea: 18.46 ā 18.46 (Improved) country_of_birth_mother_ Taiwan: 31.11 ā 31.11 (Improved) country_of_birth_mother_ Thailand: 41.98 ā 41.98 (No change) country_of_birth_mother_ Trinadad&Tobago: 49.23 ā 49.23 (No change) country_of_birth_mother_ United-States: -1.51 ā -1.51 (Improved) country_of_birth_mother_ Vietnam: 19.83 ā 19.83 (Improved) country_of_birth_mother_ Yugoslavia: 36.55 ā 36.55 (No change) country_of_birth_self_ ?: 7.47 ā 7.47 (Improved) country_of_birth_self_ Cambodia: 47.76 ā 47.76 (Improved) country_of_birth_self_ Canada: 16.22 ā 16.22 (No change) country_of_birth_self_ China: 21.56 ā 21.56 (Improved) country_of_birth_self_ Columbia: 20.36 ā 20.36 (Improved) country_of_birth_self_ Cuba: 15.06 ā 15.06 (Improved) country_of_birth_self_ Dominican-Republic: 17.33 ā 17.33 (No change) country_of_birth_self_ Ecuador: 25.60 ā 25.60 (Improved) country_of_birth_self_ El-Salvador: 16.22 ā 16.22 (Improved) country_of_birth_self_ England: 18.46 ā 18.46 (No change) country_of_birth_self_ France: 40.19 ā 40.19 (No change) country_of_birth_self_ Germany: 15.29 ā 15.29 (No change) country_of_birth_self_ Greece: 33.75 ā 33.75 (No change) country_of_birth_self_ Guatemala: 23.16 ā 23.16 (Improved) country_of_birth_self_ Haiti: 27.54 ā 27.54 (Improved) country_of_birth_self_ Holand-Netherlands: 80.43 ā 80.43 (Improved) country_of_birth_self_ Honduras: 39.37 ā 39.37 (No change) country_of_birth_self_ Hong Kong: 47.76 ā 47.76 (Improved) country_of_birth_self_ Hungary: 50.85 ā 50.85 (Improved) country_of_birth_self_ India: 20.70 ā 20.70 (No change) country_of_birth_self_ Iran: 33.75 ā 33.75 (No change) country_of_birth_self_ Ireland: 37.88 ā 37.88 (Improved) country_of_birth_self_ Italy: 21.96 ā 21.96 (No change) country_of_birth_self_ Jamaica: 25.38 ā 25.38 (No change) country_of_birth_self_ Japan: 23.16 ā 23.16 (Improved) country_of_birth_self_ Laos: 42.97 ā 42.97 (Improved) country_of_birth_self_ Mexico: 5.60 ā 5.60 (No change) country_of_birth_self_ Nicaragua: 25.60 ā 25.60 (Improved) country_of_birth_self_ Outlying-U S (Guam USVI etc): 44.03 ā 44.03 (No change) country_of_birth_self_ Panama: 80.43 ā 80.43 (Improved) country_of_birth_self_ Peru: 30.36 ā 30.36 (No change) country_of_birth_self_ Philippines: 14.38 ā 14.38 (Improved) country_of_birth_self_ Poland: 21.43 ā 21.43 (Improved) country_of_birth_self_ Portugal: 36.55 ā 36.55 (No change) country_of_birth_self_ Puerto-Rico: 11.52 ā 11.52 (No change) country_of_birth_self_ Scotland: 54.63 ā 54.63 (Improved) country_of_birth_self_ South Korea: 21.18 ā 21.18 (Improved) country_of_birth_self_ Taiwan: 33.75 ā 33.75 (No change) country_of_birth_self_ Thailand: 41.05 ā 41.05 (No change) country_of_birth_self_ Trinadad&Tobago: 62.29 ā 62.29 (Improved) country_of_birth_self_ United-States: -2.40 ā -2.40 (Improved) country_of_birth_self_ Vietnam: 23.16 ā 23.16 (Improved) country_of_birth_self_ Yugoslavia: 62.29 ā 62.29 (Improved) citizenship_ Foreign born- Not a citizen of U S : 3.39 ā 3.39 (Improved) citizenship_ Foreign born- U S citizen by naturalization: 5.54 ā 5.54 (Improved) citizenship_ Native- Born abroad of American Parent(s): 10.42 ā 10.42 (Improved) citizenship_ Native- Born in Puerto Rico or U S Outlying: 11.13 ā 11.13 (Improved) citizenship_ Native- Born in the United States: -2.40 ā -2.40 (Improved) fill_inc_questionnaire_for_veteran_ No: 11.24 ā 11.24 (No change) fill_inc_questionnaire_for_veteran_ Not in universe: -10.10 ā -10.10 (Improved) fill_inc_questionnaire_for_veteran_ Yes: 23.49 ā 23.49 (No change) === Test Data === Skewness analysis (threshold: 0.5): Found 394 potentially skewed features Skewness transformation results: occupation_code: 0.77 ā 0.32 (Improved) wage_per_hour: 8.88 ā 3.91 (Improved) capital_gains: 26.15 ā 5.89 (Improved) capital_losses: 7.64 ā 6.90 (Improved) dividends_from_stocks: 27.41 ā 3.32 (Improved) instance_weight: 1.45 ā -0.79 (Improved) own_business_or_self_employed: 2.82 ā 2.78 (Improved) veterans_benefits: -1.38 ā -1.39 (No change) work_experience: 0.74 ā -0.39 (Improved) capital_ratio: 26.15 ā 5.89 (Improved) full_year_worker: 0.51 ā 0.51 (No change) has_capital_gains: 9.56 ā 9.56 (Improved) has_capital_losses: 6.88 ā 6.88 (Improved) has_dividends: 2.64 ā 2.64 (Improved) is_married: -2.08 ā -2.08 (Improved) enrolled_in_edu_inst_encoded: 4.15 ā 3.92 (Improved) member_of_labor_union_encoded: 3.39 ā 3.09 (Improved) fill_inc_questionnaire_for_veteran_encoded: 11.41 ā 10.61 (Improved) class_of_worker_ Federal government: 8.15 ā 8.15 (Improved) class_of_worker_ Local government: 4.72 ā 4.72 (No change) class_of_worker_ Never worked: 21.53 ā 21.53 (Improved) class_of_worker_ Private: 0.53 ā 0.53 (Improved) class_of_worker_ Self-employed-incorporated: 7.57 ā 7.57 (No change) class_of_worker_ Self-employed-not incorporated: 4.42 ā 4.42 (No change) class_of_worker_ State government: 6.47 ā 6.47 (No change) class_of_worker_ Without pay: 35.58 ā 35.58 (Improved) education_ 10th grade: 4.72 ā 4.72 (Improved) education_ 11th grade: 4.92 ā 4.92 (No change) education_ 12th grade no diploma: 9.02 ā 9.02 (Improved) education_ 1st 2nd 3rd or 4th grade: 10.14 ā 10.14 (Improved) education_ 5th or 6th grade: 7.27 ā 7.27 (No change) education_ 7th and 8th grade: 4.51 ā 4.51 (No change) education_ 9th grade: 5.30 ā 5.30 (Improved) education_ Associates degree-academic program: 6.59 ā 6.59 (No change) education_ Associates degree-occup /vocational: 5.74 ā 5.74 (No change) education_ Bachelors degree(BA AB BS): 2.66 ā 2.66 (No change) education_ Children: 1.41 ā 1.41 (No change) education_ Doctorate degree(PhD EdD): 12.73 ā 12.73 (Improved) education_ High school graduate: 1.15 ā 1.15 (No change) education_ Less than 1st grade: 15.10 ā 15.10 (Improved) education_ Masters degree(MA MS MEng MEd MSW MBA): 5.19 ā 5.19 (No change) education_ Prof school degree (MD DDS DVM LLB JD): 10.64 ā 10.64 (No change) education_ Some college but no degree: 2.02 ā 2.02 (No change) enrolled_in_edu_inst_ College or university: 5.55 ā 5.55 (Improved) enrolled_in_edu_inst_ High school: 4.96 ā 4.96 (No change) enrolled_in_edu_inst_ Not in universe: -3.50 ā -3.50 (Improved) marital_status_ Divorced: 3.47 ā 3.47 (Improved) marital_status_ Married-A F spouse present: 16.74 ā 16.74 (Improved) marital_status_ Married-spouse absent: 11.45 ā 11.45 (Improved) marital_status_ Separated: 7.33 ā 7.33 (No change) marital_status_ Widowed: 3.91 ā 3.91 (Improved) major_industry_code_ Agriculture: 7.93 ā 7.93 (No change) major_industry_code_ Armed Forces: 79.64 ā 79.64 (Improved) major_industry_code_ Business and repair services: 5.43 ā 5.43 (No change) major_industry_code_ Communications: 12.78 ā 12.78 (Improved) major_industry_code_ Construction: 5.33 ā 5.33 (Improved) major_industry_code_ Education: 4.46 ā 4.46 (No change) major_industry_code_ Entertainment: 10.71 ā 10.71 (No change) major_industry_code_ Finance insurance and real estate: 5.43 ā 5.43 (No change) major_industry_code_ Forestry and fisheries: 30.96 ā 30.96 (Improved) major_industry_code_ Hospital services: 7.04 ā 7.04 (Improved) major_industry_code_ Manufacturing-durable goods: 4.36 ā 4.36 (Improved) major_industry_code_ Manufacturing-nondurable goods: 5.06 ā 5.06 (No change) major_industry_code_ Medical except hospital: 6.25 ā 6.25 (No change) major_industry_code_ Mining: 17.19 ā 17.19 (No change) major_industry_code_ Other professional services: 6.46 ā 6.46 (Improved) major_industry_code_ Personal services except private HH: 7.94 ā 7.94 (No change) major_industry_code_ Private household services: 13.79 ā 13.79 (No change) major_industry_code_ Public administration: 6.45 ā 6.45 (No change) major_industry_code_ Retail trade: 2.85 ā 2.85 (Improved) major_industry_code_ Social services: 8.69 ā 8.69 (No change) major_industry_code_ Transportation: 6.41 ā 6.41 (No change) major_industry_code_ Utilities and sanitary services: 13.07 ā 13.07 (Improved) major_industry_code_ Wholesale trade: 7.08 ā 7.08 (Improved) major_occupation_code_ Adm support including clerical: 3.22 ā 3.22 (Improved) major_occupation_code_ Armed Forces: 79.64 ā 79.64 (Improved) major_occupation_code_ Executive admin and managerial: 3.57 ā 3.57 (No change) major_occupation_code_ Farming forestry and fishing: 7.76 ā 7.76 (Improved) major_occupation_code_ Handlers equip cleaners etc : 6.59 ā 6.59 (Improved) major_occupation_code_ Machine operators assmblrs & inspctrs: 5.22 ā 5.22 (Improved) major_occupation_code_ Other service: 3.54 ā 3.54 (Improved) major_occupation_code_ Precision production craft & repair: 3.89 ā 3.89 (Improved) major_occupation_code_ Private household services: 14.88 ā 14.88 (Improved) major_occupation_code_ Professional specialty: 3.37 ā 3.37 (Improved) major_occupation_code_ Protective services: 10.93 ā 10.93 (No change) major_occupation_code_ Sales: 3.63 ā 3.63 (Improved) major_occupation_code_ Technicians and related support: 7.93 ā 7.93 (No change) major_occupation_code_ Transportation and material moving: 6.67 ā 6.67 (Improved) race_ Amer Indian Aleut or Eskimo: 8.78 ā 8.78 (Improved) race_ Asian or Pacific Islander: 5.50 ā 5.50 (Improved) race_ Black: 2.61 ā 2.61 (Improved) race_ Other: 6.90 ā 6.90 (No change) race_ White: -1.80 ā -1.80 (No change) hispanic_origin_ All other: -2.04 ā -2.04 (Improved) hispanic_origin_ Central or South American: 6.72 ā 6.72 (Improved) hispanic_origin_ Chicano: 23.74 ā 23.74 (Improved) hispanic_origin_ Cuban: 12.40 ā 12.40 (No change) hispanic_origin_ Do not know: 26.11 ā 26.11 (Improved) hispanic_origin_ Mexican (Mexicano): 4.81 ā 4.81 (Improved) hispanic_origin_ Mexican-American: 4.64 ā 4.64 (Improved) hispanic_origin_ NA: 15.43 ā 15.43 (Improved) hispanic_origin_ Other Spanish: 8.65 ā 8.65 (No change) hispanic_origin_ Puerto Rican: 7.60 ā 7.60 (No change) member_of_labor_union_ No: 3.06 ā 3.06 (Improved) member_of_labor_union_ Not in universe: -2.74 ā -2.74 (Improved) member_of_labor_union_ Yes: 8.19 ā 8.19 (Improved) reason_for_unemployment_ Job leaver: 18.19 ā 18.19 (Improved) reason_for_unemployment_ Job loser - on layoff: 13.59 ā 13.59 (Improved) reason_for_unemployment_ New entrant: 21.53 ā 21.53 (Improved) reason_for_unemployment_ Not in universe: -5.25 ā -5.25 (Improved) reason_for_unemployment_ Other job loser: 9.14 ā 9.14 (Improved) reason_for_unemployment_ Re-entrant: 9.50 ā 9.50 (No change) full_or_part_time_employment_ Full-time schedules: 1.42 ā 1.42 (Improved) full_or_part_time_employment_ Not in labor force: 2.09 ā 2.09 (No change) full_or_part_time_employment_ PT for econ reasons usually FT: 19.13 ā 19.13 (Improved) full_or_part_time_employment_ PT for econ reasons usually PT: 12.81 ā 12.81 (Improved) full_or_part_time_employment_ PT for non-econ reasons usually FT: 7.28 ā 7.28 (Improved) full_or_part_time_employment_ Unemployed full-time: 8.74 ā 8.74 (Improved) full_or_part_time_employment_ Unemployed part- time: 15.19 ā 15.19 (Improved) tax_filer_status_ Head of household: 4.76 ā 4.76 (Improved) tax_filer_status_ Joint both 65+: 4.51 ā 4.51 (Improved) tax_filer_status_ Joint both under 65: 0.64 ā 0.64 (Improved) tax_filer_status_ Joint one under 65 & one 65+: 6.81 ā 6.81 (No change) tax_filer_status_ Nonfiler: 0.60 ā 0.60 (Improved) tax_filer_status_ Single: 1.54 ā 1.54 (No change) region_of_previous_residence_ Abroad: 21.69 ā 21.69 (Improved) region_of_previous_residence_ Midwest: 7.35 ā 7.35 (Improved) region_of_previous_residence_ Northeast: 8.36 ā 8.36 (Improved) region_of_previous_residence_ Not in universe: -3.10 ā -3.10 (Improved) region_of_previous_residence_ South: 6.07 ā 6.07 (No change) region_of_previous_residence_ West: 6.66 ā 6.66 (Improved) state_of_previous_residence_ ?: 16.97 ā 16.97 (No change) state_of_previous_residence_ Abroad: 18.95 ā 18.95 (Improved) state_of_previous_residence_ Alabama: 29.78 ā 29.78 (No change) state_of_previous_residence_ Alaska: 28.11 ā 28.11 (No change) state_of_previous_residence_ Arizona: 26.40 ā 26.40 (No change) state_of_previous_residence_ Arkansas: 30.96 ā 30.96 (No change) state_of_previous_residence_ California: 10.32 ā 10.32 (Improved) state_of_previous_residence_ Colorado: 29.10 ā 29.10 (Improved) state_of_previous_residence_ Connecticut: 38.23 ā 38.23 (Improved) state_of_previous_residence_ Delaware: 45.46 ā 45.46 (No change) state_of_previous_residence_ District of Columbia: 41.95 ā 41.95 (Improved) state_of_previous_residence_ Florida: 14.56 ā 14.56 (Improved) state_of_previous_residence_ Georgia: 29.23 ā 29.23 (No change) state_of_previous_residence_ Idaho: 85.55 ā 85.55 (No change) state_of_previous_residence_ Illinois: 33.42 ā 33.42 (Improved) state_of_previous_residence_ Indiana: 17.94 ā 17.94 (Improved) state_of_previous_residence_ Iowa: 38.53 ā 38.53 (No change) state_of_previous_residence_ Kansas: 36.83 ā 36.83 (No change) state_of_previous_residence_ Kentucky: 29.10 ā 29.10 (Improved) state_of_previous_residence_ Louisiana: 28.47 ā 28.47 (No change) state_of_previous_residence_ Maine: 38.53 ā 38.53 (No change) state_of_previous_residence_ Maryland: 34.67 ā 34.67 (No change) state_of_previous_residence_ Massachusetts: 36.07 ā 36.07 (No change) state_of_previous_residence_ Michigan: 21.97 ā 21.97 (Improved) state_of_previous_residence_ Minnesota: 18.23 ā 18.23 (Improved) state_of_previous_residence_ Mississippi: 34.89 ā 34.89 (Improved) state_of_previous_residence_ Missouri: 36.83 ā 36.83 (No change) state_of_previous_residence_ Montana: 30.20 ā 30.20 (Improved) state_of_previous_residence_ Nebraska: 30.96 ā 30.96 (No change) state_of_previous_residence_ Nevada: 35.12 ā 35.12 (No change) state_of_previous_residence_ New Hampshire: 29.23 ā 29.23 (No change) state_of_previous_residence_ New Jersey: 50.02 ā 50.02 (Improved) state_of_previous_residence_ New Mexico: 20.59 ā 20.59 (Improved) state_of_previous_residence_ New York: 29.78 ā 29.78 (No change) state_of_previous_residence_ North Carolina: 15.47 ā 15.47 (Improved) state_of_previous_residence_ North Dakota: 20.63 ā 20.63 (No change) state_of_previous_residence_ Not in universe: -3.10 ā -3.10 (Improved) state_of_previous_residence_ Ohio: 30.96 ā 30.96 (No change) state_of_previous_residence_ Oklahoma: 17.85 ā 17.85 (No change) state_of_previous_residence_ Oregon: 29.64 ā 29.64 (No change) state_of_previous_residence_ Pennsylvania: 31.44 ā 31.44 (No change) state_of_previous_residence_ South Carolina: 44.04 ā 44.04 (No change) state_of_previous_residence_ South Dakota: 36.07 ā 36.07 (No change) state_of_previous_residence_ Tennessee: 31.77 ā 31.77 (No change) state_of_previous_residence_ Texas: 31.77 ā 31.77 (No change) state_of_previous_residence_ Utah: 13.30 ā 13.30 (Improved) state_of_previous_residence_ Vermont: 30.96 ā 30.96 (Improved) state_of_previous_residence_ Virginia: 40.47 ā 40.47 (Improved) state_of_previous_residence_ West Virginia: 30.50 ā 30.50 (No change) state_of_previous_residence_ Wisconsin: 39.14 ā 39.14 (Improved) state_of_previous_residence_ Wyoming: 30.80 ā 30.80 (Improved) detailed_household_summary_ Child 18+ ever marr Not in a subfamily: 14.25 ā 14.25 (No change) detailed_household_summary_ Child 18+ ever marr RP of subfamily: 16.31 ā 16.31 (No change) detailed_household_summary_ Child 18+ never marr Not in a subfamily: 3.56 ā 3.56 (Improved) detailed_household_summary_ Child 18+ never marr RP of subfamily: 17.52 ā 17.52 (No change) detailed_household_summary_ Child 18+ spouse of subfamily RP: 35.12 ā 35.12 (No change) detailed_household_summary_ Child <18 ever marr RP of subfamily: 125.94 ā 125.94 (Improved) detailed_household_summary_ Child <18 ever marr not in subfamily: 79.64 ā 79.64 (Improved) detailed_household_summary_ Child <18 never marr RP of subfamily: 49.37 ā 49.37 (Improved) detailed_household_summary_ Child <18 never marr not in subfamily: 1.30 ā 1.30 (No change) detailed_household_summary_ Child <18 spouse of subfamily RP: 308.51 ā 308.51 (Improved) detailed_household_summary_ Child under 18 of RP of unrel subfamily: 16.42 ā 16.42 (Improved) detailed_household_summary_ Grandchild 18+ ever marr RP of subfamily: 125.94 ā 125.94 (Improved) detailed_household_summary_ Grandchild 18+ ever marr not in subfamily: 70.76 ā 70.76 (No change) detailed_household_summary_ Grandchild 18+ never marr RP of subfamily: 308.51 ā 308.51 (Improved) detailed_household_summary_ Grandchild 18+ never marr not in subfamily: 21.37 ā 21.37 (No change) detailed_household_summary_ Grandchild 18+ spouse of subfamily RP: 154.25 ā 154.25 (Improved) detailed_household_summary_ Grandchild <18 never marr RP of subfamily: 308.51 ā 308.51 (No change) detailed_household_summary_ Grandchild <18 never marr child of subfamily RP: 10.38 ā 10.38 (Improved) detailed_household_summary_ Grandchild <18 never marr not in subfamily: 13.74 ā 13.74 (Improved) detailed_household_summary_ Householder: 1.02 ā 1.02 (No change) detailed_household_summary_ In group quarters: 34.45 ā 34.45 (No change) detailed_household_summary_ Nonfamily householder: 2.41 ā 2.41 (No change) detailed_household_summary_ Other Rel 18+ ever marr RP of subfamily: 17.49 ā 17.49 (Improved) detailed_household_summary_ Other Rel 18+ ever marr not in subfamily: 9.54 ā 9.54 (No change) detailed_household_summary_ Other Rel 18+ never marr RP of subfamily: 38.83 ā 38.83 (No change) detailed_household_summary_ Other Rel 18+ never marr not in subfamily: 10.56 ā 10.56 (Improved) detailed_household_summary_ Other Rel 18+ spouse of subfamily RP: 16.31 ā 16.31 (Improved) detailed_household_summary_ Other Rel <18 ever marr RP of subfamily: 178.11 ā 178.11 (No change) detailed_household_summary_ Other Rel <18 ever marr not in subfamily: 218.15 ā 218.15 (Improved) detailed_household_summary_ Other Rel <18 never marr child of subfamily RP: 17.82 ā 17.82 (Improved) detailed_household_summary_ Other Rel <18 never marr not in subfamily: 17.88 ā 17.88 (Improved) detailed_household_summary_ Other Rel <18 never married RP of subfamily: 218.15 ā 218.15 (Improved) detailed_household_summary_ Other Rel <18 spouse of subfamily RP: 218.15 ā 218.15 (Improved) detailed_household_summary_ RP of unrelated subfamily: 16.67 ā 16.67 (Improved) detailed_household_summary_ Secondary individual: 5.27 ā 5.27 (No change) detailed_household_summary_ Spouse of RP of unrelated subfamily: 62.95 ā 62.95 (Improved) detailed_household_summary_ Spouse of householder: 1.37 ā 1.37 (Improved) detailed_household_summary_in_household_ Child 18 or older: 3.18 ā 3.18 (No change) detailed_household_summary_in_household_ Child under 18 ever married: 65.75 ā 65.75 (Improved) detailed_household_summary_in_household_ Child under 18 never married: 1.30 ā 1.30 (Improved) detailed_household_summary_in_household_ Group Quarters- Secondary individual: 42.34 ā 42.34 (No change) detailed_household_summary_in_household_ Nonrelative of householder: 4.69 ā 4.69 (No change) detailed_household_summary_in_household_ Other relative of householder: 4.13 ā 4.13 (No change) detailed_household_summary_in_household_ Spouse of householder: 1.37 ā 1.37 (Improved) migration_code_change_in_msa_ Abroad to MSA: 24.03 ā 24.03 (No change) migration_code_change_in_msa_ Abroad to nonMSA: 51.39 ā 51.39 (Improved) migration_code_change_in_msa_ MSA to MSA: 3.92 ā 3.92 (No change) migration_code_change_in_msa_ MSA to nonMSA: 16.12 ā 16.12 (No change) migration_code_change_in_msa_ NonMSA to MSA: 17.94 ā 17.94 (Improved) migration_code_change_in_msa_ NonMSA to nonMSA: 8.28 ā 8.28 (Improved) migration_code_change_in_msa_ Not identifiable: 21.80 ā 21.80 (No change) migration_code_change_in_msa_ Not in universe: 12.40 ā 12.40 (No change) migration_code_change_in_reg_ Abroad: 21.69 ā 21.69 (Improved) migration_code_change_in_reg_ Different county same state: 8.35 ā 8.35 (Improved) migration_code_change_in_reg_ Different division same region: 20.59 ā 20.59 (Improved) migration_code_change_in_reg_ Different region: 12.70 ā 12.70 (Improved) migration_code_change_in_reg_ Different state same division: 14.16 ā 14.16 (Improved) migration_code_change_in_reg_ Not in universe: 12.40 ā 12.40 (No change) migration_code_change_in_reg_ Same county: 4.11 ā 4.11 (No change) migration_code_move_within_reg_ Abroad: 21.69 ā 21.69 (Improved) migration_code_move_within_reg_ Different county same state: 8.35 ā 8.35 (Improved) migration_code_move_within_reg_ Different state in Midwest: 19.84 ā 19.84 (Improved) migration_code_move_within_reg_ Different state in Northeast: 21.43 ā 21.43 (No change) migration_code_move_within_reg_ Different state in South: 14.11 ā 14.11 (Improved) migration_code_move_within_reg_ Different state in West: 16.33 ā 16.33 (No change) migration_code_move_within_reg_ Not in universe: 12.40 ā 12.40 (No change) migration_code_move_within_reg_ Same county: 4.11 ā 4.11 (No change) live_in_this_house_1_year_ago_ No: 3.10 ā 3.10 (Improved) migration_prev_res_in_sunbelt_ No: 4.14 ā 4.14 (No change) migration_prev_res_in_sunbelt_ Yes: 5.52 ā 5.52 (No change) family_members_under_18_ Both parents present: 1.75 ā 1.75 (No change) family_members_under_18_ Father only present: 9.98 ā 9.98 (Improved) family_members_under_18_ Mother only present: 3.62 ā 3.62 (Improved) family_members_under_18_ Neither parent present: 10.81 ā 10.81 (Improved) family_members_under_18_ Not in universe: -1.15 ā -1.15 (Improved) country_of_birth_father_ ?: 5.04 ā 5.04 (No change) country_of_birth_father_ Cambodia: 28.59 ā 28.59 (Improved) country_of_birth_father_ Canada: 12.00 ā 12.00 (Improved) country_of_birth_father_ China: 15.51 ā 15.51 (Improved) country_of_birth_father_ Columbia: 18.03 ā 18.03 (No change) country_of_birth_father_ Cuba: 12.68 ā 12.68 (No change) country_of_birth_father_ Dominican-Republic: 11.74 ā 11.74 (Improved) country_of_birth_father_ Ecuador: 22.20 ā 22.20 (No change) country_of_birth_father_ El-Salvador: 13.72 ā 13.72 (Improved) country_of_birth_father_ England: 15.90 ā 15.90 (No change) country_of_birth_father_ France: 33.82 ā 33.82 (Improved) country_of_birth_father_ Germany: 12.11 ā 12.11 (Improved) country_of_birth_father_ Greece: 21.75 ā 21.75 (Improved) country_of_birth_father_ Guatemala: 20.36 ā 20.36 (Improved) country_of_birth_father_ Haiti: 24.72 ā 24.72 (Improved) country_of_birth_father_ Holand-Netherlands: 67.30 ā 67.30 (No change) country_of_birth_father_ Honduras: 30.06 ā 30.06 (Improved) country_of_birth_father_ Hong Kong: 44.04 ā 44.04 (No change) country_of_birth_father_ Hungary: 23.96 ā 23.96 (No change) country_of_birth_father_ India: 17.76 ā 17.76 (Improved) country_of_birth_father_ Iran: 30.35 ā 30.35 (No change) country_of_birth_father_ Ireland: 17.70 ā 17.70 (Improved) country_of_birth_father_ Italy: 9.15 ā 9.15 (Improved) country_of_birth_father_ Jamaica: 19.47 ā 19.47 (No change) country_of_birth_father_ Japan: 21.43 ā 21.43 (Improved) country_of_birth_father_ Laos: 34.45 ā 34.45 (Improved) country_of_birth_father_ Mexico: 4.04 ā 4.04 (No change) country_of_birth_father_ Nicaragua: 23.19 ā 23.19 (Improved) country_of_birth_father_ Outlying-U S (Guam USVI etc): 35.58 ā 35.58 (No change) country_of_birth_father_ Panama: 77.11 ā 77.11 (Improved) country_of_birth_father_ Peru: 24.56 ā 24.56 (Improved) country_of_birth_father_ Philippines: 12.62 ā 12.62 (Improved) country_of_birth_father_ Poland: 12.33 ā 12.33 (Improved) country_of_birth_father_ Portugal: 22.03 ā 22.03 (Improved) country_of_birth_father_ Puerto-Rico: 8.46 ā 8.46 (No change) country_of_birth_father_ Scotland: 28.59 ā 28.59 (Improved) country_of_birth_father_ South Korea: 19.28 ā 19.28 (No change) country_of_birth_father_ Taiwan: 34.03 ā 34.03 (Improved) country_of_birth_father_ Thailand: 41.95 ā 41.95 (Improved) country_of_birth_father_ Trinadad&Tobago: 37.65 ā 37.65 (Improved) country_of_birth_father_ United-States: -1.42 ā -1.42 (No change) country_of_birth_father_ Vietnam: 20.92 ā 20.92 (Improved) country_of_birth_father_ Yugoslavia: 28.11 ā 28.11 (No change) country_of_birth_mother_ ?: 5.35 ā 5.35 (No change) country_of_birth_mother_ Cambodia: 30.80 ā 30.80 (Improved) country_of_birth_mother_ Canada: 11.65 ā 11.65 (No change) country_of_birth_mother_ China: 16.10 ā 16.10 (Improved) country_of_birth_mother_ Columbia: 18.07 ā 18.07 (Improved) country_of_birth_mother_ Cuba: 12.58 ā 12.58 (Improved) country_of_birth_mother_ Dominican-Republic: 13.25 ā 13.25 (Improved) country_of_birth_mother_ Ecuador: 21.75 ā 21.75 (Improved) country_of_birth_mother_ El-Salvador: 13.26 ā 13.26 (Improved) country_of_birth_mother_ England: 14.99 ā 14.99 (Improved) country_of_birth_mother_ France: 33.03 ā 33.03 (No change) country_of_birth_mother_ Germany: 11.82 ā 11.82 (No change) country_of_birth_mother_ Greece: 24.96 ā 24.96 (Improved) country_of_birth_mother_ Guatemala: 20.40 ā 20.40 (No change) country_of_birth_mother_ Haiti: 24.33 ā 24.33 (Improved) country_of_birth_mother_ Holand-Netherlands: 68.96 ā 68.96 (No change) country_of_birth_mother_ Honduras: 29.10 ā 29.10 (Improved) country_of_birth_mother_ Hong Kong: 44.97 ā 44.97 (No change) country_of_birth_mother_ Hungary: 23.81 ā 23.81 (No change) country_of_birth_mother_ India: 17.64 ā 17.64 (No change) country_of_birth_mother_ Iran: 32.30 ā 32.30 (No change) country_of_birth_mother_ Ireland: 16.84 ā 16.84 (Improved) country_of_birth_mother_ Italy: 10.14 ā 10.14 (No change) country_of_birth_mother_ Jamaica: 19.47 ā 19.47 (No change) country_of_birth_mother_ Japan: 18.98 ā 18.98 (Improved) country_of_birth_mother_ Laos: 36.57 ā 36.57 (No change) country_of_birth_mother_ Mexico: 4.06 ā 4.06 (Improved) country_of_birth_mother_ Nicaragua: 22.74 ā 22.74 (Improved) country_of_birth_mother_ Outlying-U S (Guam USVI etc): 39.46 ā 39.46 (Improved) country_of_birth_mother_ Panama: 72.70 ā 72.70 (Improved) country_of_birth_mother_ Peru: 23.96 ā 23.96 (Improved) country_of_birth_mother_ Philippines: 12.07 ā 12.07 (Improved) country_of_birth_mother_ Poland: 12.74 ā 12.74 (Improved) country_of_birth_mother_ Portugal: 23.06 ā 23.06 (Improved) country_of_birth_mother_ Puerto-Rico: 8.79 ā 8.79 (Improved) country_of_birth_mother_ Scotland: 29.10 ā 29.10 (No change) country_of_birth_mother_ South Korea: 17.91 ā 17.91 (Improved) country_of_birth_mother_ Taiwan: 30.50 ā 30.50 (Improved) country_of_birth_mother_ Thailand: 36.83 ā 36.83 (No change) country_of_birth_mother_ Trinadad&Tobago: 41.19 ā 41.19 (No change) country_of_birth_mother_ United-States: -1.47 ā -1.47 (No change) country_of_birth_mother_ Vietnam: 20.14 ā 20.14 (No change) country_of_birth_mother_ Yugoslavia: 30.80 ā 30.80 (Improved) country_of_birth_self_ ?: 7.19 ā 7.19 (Improved) country_of_birth_self_ Cambodia: 38.83 ā 38.83 (No change) country_of_birth_self_ Canada: 16.67 ā 16.67 (Improved) country_of_birth_self_ China: 20.01 ā 20.01 (Improved) country_of_birth_self_ Columbia: 21.17 ā 21.17 (Improved) country_of_birth_self_ Cuba: 14.86 ā 14.86 (Improved) country_of_birth_self_ Dominican-Republic: 17.03 ā 17.03 (No change) country_of_birth_self_ Ecuador: 26.70 ā 26.70 (Improved) country_of_birth_self_ El-Salvador: 16.59 ā 16.59 (Improved) country_of_birth_self_ England: 20.63 ā 20.63 (No change) country_of_birth_self_ France: 40.83 ā 40.83 (No change) country_of_birth_self_ Germany: 15.06 ā 15.06 (Improved) country_of_birth_self_ Greece: 36.57 ā 36.57 (Improved) country_of_birth_self_ Guatemala: 23.96 ā 23.96 (No change) country_of_birth_self_ Haiti: 30.96 ā 30.96 (No change) country_of_birth_self_ Holand-Netherlands: 93.01 ā 93.01 (Improved) country_of_birth_self_ Honduras: 35.12 ā 35.12 (No change) country_of_birth_self_ Hong Kong: 42.75 ā 42.75 (Improved) country_of_birth_self_ Hungary: 56.30 ā 56.30 (No change) country_of_birth_self_ India: 21.43 ā 21.43 (No change) country_of_birth_self_ Iran: 37.94 ā 37.94 (Improved) country_of_birth_self_ Ireland: 36.83 ā 36.83 (No change) country_of_birth_self_ Italy: 20.54 ā 20.54 (Improved) country_of_birth_self_ Jamaica: 23.19 ā 23.19 (Improved) country_of_birth_self_ Japan: 23.06 ā 23.06 (Improved) country_of_birth_self_ Laos: 41.95 ā 41.95 (Improved) country_of_birth_self_ Mexico: 5.45 ā 5.45 (Improved) country_of_birth_self_ Nicaragua: 27.00 ā 27.00 (Improved) country_of_birth_self_ Outlying-U S (Guam USVI etc): 45.46 ā 45.46 (No change) country_of_birth_self_ Panama: 93.01 ā 93.01 (Improved) country_of_birth_self_ Peru: 28.11 ā 28.11 (Improved) country_of_birth_self_ Philippines: 14.41 ā 14.41 (No change) country_of_birth_self_ Poland: 23.60 ā 23.60 (No change) country_of_birth_self_ Portugal: 31.77 ā 31.77 (Improved) country_of_birth_self_ Puerto-Rico: 11.62 ā 11.62 (Improved) country_of_birth_self_ Scotland: 52.12 ā 52.12 (No change) country_of_birth_self_ South Korea: 20.09 ā 20.09 (Improved) country_of_birth_self_ Taiwan: 34.45 ā 34.45 (Improved) country_of_birth_self_ Thailand: 38.83 ā 38.83 (No change) country_of_birth_self_ Trinadad&Tobago: 45.46 ā 45.46 (No change) country_of_birth_self_ United-States: -2.36 ā -2.36 (No change) country_of_birth_self_ Vietnam: 22.93 ā 22.93 (Improved) country_of_birth_self_ Yugoslavia: 44.97 ā 44.97 (No change) citizenship_ Foreign born- Not a citizen of U S : 3.37 ā 3.37 (Improved) citizenship_ Foreign born- U S citizen by naturalization: 5.38 ā 5.38 (No change) citizenship_ Native- Born abroad of American Parent(s): 9.95 ā 9.95 (No change) citizenship_ Native- Born in Puerto Rico or U S Outlying: 11.24 ā 11.24 (Improved) citizenship_ Native- Born in the United States: -2.36 ā -2.36 (Improved) fill_inc_questionnaire_for_veteran_ No: 10.71 ā 10.71 (No change) fill_inc_questionnaire_for_veteran_ Not in universe: -9.58 ā -9.58 (No change) fill_inc_questionnaire_for_veteran_ Yes: 21.97 ā 21.97 (No change)
4.9 Scale/Normalize¶
# Function to describe data before and after scaling
def print_scaling_info(data_before, data_after, dataset_name=""):
"""Print statistical information about scaling effects"""
print(f"\n{dataset_name} Scaling Results:")
# Convert to numpy if needed
if isinstance(data_after, pd.DataFrame):
data_after = data_after.values
if isinstance(data_before, pd.DataFrame):
data_before = data_before.values
# Calculate statistics
before_mean = np.mean(data_before)
before_std = np.std(data_before)
before_min = np.min(data_before)
before_max = np.max(data_before)
after_mean = np.mean(data_after)
after_std = np.std(data_after)
after_min = np.min(data_after)
after_max = np.max(data_after)
# Print comparison
print(f" Before scaling: mean={before_mean:.4f}, std={before_std:.4f}, min={before_min:.4f}, max={before_max:.4f}")
print(f" After scaling: mean={after_mean:.4f}, std={after_std:.4f}, min={after_min:.4f}, max={after_max:.4f}")
# Apply Robust Scaling
print("\nApplying RobustScaler to handle outliers and create uniform feature scales...")
scaler = RobustScaler()
# Fit scaler on training data
X_train_scaled = scaler.fit_transform(X_train_skew)
# Apply same transformation to validation and test data
X_val_scaled = scaler.transform(X_val_skew)
X_test_scaled = scaler.transform(X_test_skew)
# Print information about the scaling results
print_scaling_info(X_train_skew, X_train_scaled, "Training Data")
print_scaling_info(X_val_skew, X_val_scaled, "Validation Data")
print_scaling_info(X_test_skew, X_test_scaled, "Test Data")
# Convert back to DataFrames to preserve column names
X_train_scaled_df = pd.DataFrame(X_train_scaled, columns=X_train_skew.columns, index=X_train_skew.index)
X_val_scaled_df = pd.DataFrame(X_val_scaled, columns=X_val_skew.columns, index=X_val_skew.index)
X_test_scaled_df = pd.DataFrame(X_test_scaled, columns=X_test_skew.columns, index=X_test_skew.index)
print(f"\nScaled dataset shapes:")
print(f" X_train_scaled: {X_train_scaled_df.shape}")
print(f" X_val_scaled: {X_val_scaled_df.shape}")
print(f" X_test_scaled: {X_test_scaled_df.shape}")
Applying RobustScaler to handle outliers and create uniform feature scales... Training Data Scaling Results: Before scaling: mean=0.8399, std=5.2905, min=-1.0000, max=95.0000 After scaling: mean=0.0302, std=0.2654, min=-5.0888, max=10.8198 Validation Data Scaling Results: Before scaling: mean=0.8567, std=5.4131, min=-1.0000, max=95.0000 After scaling: mean=0.0240, std=0.6453, min=-5.1320, max=17.5763 Test Data Scaling Results: Before scaling: mean=0.8842, std=5.4257, min=-1.0000, max=95.0000 After scaling: mean=0.0484, std=0.6541, min=-4.9534, max=17.5763 Scaled dataset shapes: X_train_scaled: (155305, 420) X_val_scaled: (38829, 420) X_test_scaled: (95180, 420)
4.10 Dimensionality Reduction¶
def apply_pca(X_train_scaled, X_val_scaled, X_test_scaled, variance_threshold=0.95):
"""
Apply PCA to scaled data and generate a scree plot
Args:
X_train_scaled, X_val_scaled, X_test_scaled: Scaled numpy arrays or DataFrames
variance_threshold: Desired explained variance (default 0.95)
Returns:
PCA-transformed datasets
"""
# Convert to numpy arrays if DataFrames
if isinstance(X_train_scaled, pd.DataFrame):
# Save the indices before conversion
train_index = X_train_scaled.index
val_index = X_val_scaled.index
test_index = X_test_scaled.index
# Convert to numpy for PCA
X_train_scaled_np = X_train_scaled.values
X_val_scaled_np = X_val_scaled.values
X_test_scaled_np = X_test_scaled.values
else:
# Already numpy arrays, need to get indices from another source
train_index = np.arange(X_train_scaled.shape[0])
val_index = np.arange(X_val_scaled.shape[0])
test_index = np.arange(X_test_scaled.shape[0])
X_train_scaled_np = X_train_scaled
X_val_scaled_np = X_val_scaled
X_test_scaled_np = X_test_scaled
# Fit PCA only on training data
pca = PCA(n_components=variance_threshold)
X_train_pca = pca.fit_transform(X_train_scaled_np)
X_val_pca = pca.transform(X_val_scaled_np)
X_test_pca = pca.transform(X_test_scaled_np)
# Create PCA DataFrames
pca_cols = [f'PC{i+1}' for i in range(X_train_pca.shape[1])]
X_train_pca_df = pd.DataFrame(X_train_pca, columns=pca_cols, index=train_index)
X_val_pca_df = pd.DataFrame(X_val_pca, columns=pca_cols, index=val_index)
X_test_pca_df = pd.DataFrame(X_test_pca, columns=pca_cols, index=test_index)
# Generate scree plot
plt.figure(figsize=(10, 6))
# Get explained variance
explained_variance = pca.explained_variance_ratio_
cumulative_variance = np.cumsum(explained_variance)
n_components = len(explained_variance)
# Bar chart for individual explained variance
plt.bar(range(1, n_components+1), explained_variance, alpha=0.7,
align='center', label='Individual explained variance', color='skyblue')
# Line plot for cumulative explained variance
plt.step(range(1, n_components+1), cumulative_variance, where='mid',
label='Cumulative explained variance', color='red', linewidth=2)
# Add a horizontal line at variance threshold
plt.axhline(y=variance_threshold, color='green', linestyle='--',
label=f'Variance threshold: {variance_threshold}')
# Annotate key points
for i, (ev, cv) in enumerate(zip(explained_variance, cumulative_variance)):
if i == 0 or i == n_components-1: # Always label first and last components
plt.text(i+1, cv+0.02, f'{cv:.2f}', ha='center', color='darkred', fontweight='bold')
elif cv >= variance_threshold and cv-explained_variance[i] < variance_threshold:
# Label the component that crosses the threshold
plt.text(i+1, cv+0.02, f'{cv:.2f}', ha='center', color='darkred', fontweight='bold')
plt.axvline(x=i+1, color='green', linestyle='--', alpha=0.3)
# Add labels and title
plt.ylabel('Explained Variance Ratio', fontsize=12)
plt.xlabel('Principal Component', fontsize=12)
plt.title('PCA Scree Plot: Explained Variance by Component', fontsize=14)
plt.legend(loc='best')
plt.grid(True, alpha=0.3)
plt.tight_layout()
# Print summary statistics
print(f"Explained variance ratio: {np.sum(pca.explained_variance_ratio_):.4f}")
print(f"Original number of features: {X_train_scaled_np.shape[1]}")
print(f"Features after PCA: {X_train_pca_df.shape[1]}")
# Find how many components needed for threshold
components_for_threshold = np.where(cumulative_variance >= variance_threshold)[0][0] + 1
print(f"Components needed for {variance_threshold*100:.0f}% variance: {components_for_threshold}")
# Show plot
plt.show()
return X_train_pca_df, X_val_pca_df, X_test_pca_df, pca
# To apply PCA, replace these lines after the scaling step:
# Apply PCA to the scaled data (can be commented out to skip PCA)
X_train_pca, X_val_pca, X_test_pca, pca = apply_pca(X_train_scaled_df, X_val_scaled_df, X_test_scaled_df)
# Set these variables for next steps - allowing flexibility to use or skip PCA
# If PCA is used:
X_train_processed = X_train_pca
X_val_processed = X_val_pca
X_test_processed = X_test_pca
# If PCA is skipped (commented out):
# X_train_processed = X_train_scaled_df
# X_val_processed = X_val_scaled_df
# X_test_processed = X_test_scaled_df
Explained variance ratio: 0.9501 Original number of features: 420 Features after PCA: 40 Components needed for 95% variance: 40
4.11 Remove Multicollinearity¶
def remove_multicollinearity(X_train, X_val, X_test, threshold=0.75):
"""
Remove highly correlated features
This function works with either DataFrames or numpy arrays
"""
# Convert to DataFrame if numpy arrays
is_numpy = isinstance(X_train, np.ndarray)
if is_numpy:
X_train = pd.DataFrame(X_train)
X_val = pd.DataFrame(X_val)
X_test = pd.DataFrame(X_test)
# Calculate correlation matrix
corr_matrix = X_train.corr().abs()
# Find high correlation pairs
upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(bool))
high_corr_pairs = []
# Detailed correlation reporting
print("\nFeature correlations above threshold:")
for i in range(len(corr_matrix.columns)):
for j in range(i):
if corr_matrix.iloc[i, j] > threshold:
col1 = corr_matrix.columns[i]
col2 = corr_matrix.columns[j]
corr_value = corr_matrix.iloc[i, j]
high_corr_pairs.append((col1, col2, corr_value))
print(f"⢠{col1} & {col2}: {corr_value:.3f}")
# Identify columns to drop
to_drop = [column for column in upper.columns if any(upper[column] > threshold)]
# Remove features if any to drop
if to_drop:
print(f"\nDropping {len(to_drop)} features due to multicollinearity:")
print(", ".join(to_drop))
X_train = X_train.drop(to_drop, axis=1)
X_val = X_val.drop(to_drop, axis=1)
X_test = X_test.drop(to_drop, axis=1)
else:
print("\nNo features meet correlation threshold for removal")
# Return in original format
if is_numpy:
return X_train.values, X_val.values, X_test.values
return X_train, X_val, X_test
# To use the function with your datasets
X_train_final, X_val_final, X_test_final = remove_multicollinearity(
X_train_processed, X_val_processed, X_test_processed, threshold=0.85
)
# print(X_train_final.shape, X_val_final.shape,X_test_final.shape)
Feature correlations above threshold: No features meet correlation threshold for removal
5. Modeling¶
Let's implement various machine learning models and evaluate their performance.
The choice of metrics depends on whether your classification problem is binary, multi-class, or imbalanced. Here are the most common evaluation metrics to compare your models:
General Metrics for Classification¶
| Metric | Description | Best for |
|---|---|---|
| Accuracy | Percentage of correct predictions. | Balanced datasets with equal class distribution. |
| Precision | Measures how many predicted positives are actually positive (TP / (TP + FP)). | When false positives are costly (e.g., fraud detection). |
| Recall (Sensitivity) | Measures how many actual positives were correctly predicted (TP / (TP + FN)). | When false negatives are costly (e.g., medical diagnosis). |
| F1-Score | Harmonic mean of precision and recall. Balances both metrics. | Imbalanced datasets. |
| ROC-AUC (Receiver Operating Characteristic - Area Under Curve) | Measures how well a model distinguishes between classes. | Binary classification, imbalanced datasets. |
| PR-AUC (Precision-Recall AUC) | Measures precision vs. recall trade-off. | Imbalanced datasets. |
| Log Loss (Cross-Entropy Loss) | Measures the uncertainty of the modelās predictions. | Probabilistic classification. |
Computational Performance Metrics¶
| Metric | Description | Best for |
|---|---|---|
| Training Time | Measures how long the model takes to train. | Large datasets. |
| Inference Time | Measures how fast the model predicts new data. | Real-time applications. |
| Model Size | How much memory the model consumes. | When deployment constraints exist. |
How to Compare Models?¶
- Train all models on the same dataset.
- Use cross-validation** to reduce variance.
- Record performance metrics (F1-score, ROC-AUC, etc.).
- Compare training time and inference speed if necessary.
- Pick the best model based on the most relevant metric for your application.
5.1 Model Training and Evaluation¶
# First, make sure you're using the updated variables consistently
print(f"X_train_final shape: {X_train_final.shape}")
print(f"y_train_out shape: {y_train_out.shape}")
print(f"X_val_final shape: {X_val_final.shape}")
print(f"y_val_out shape: {y_val_out.shape}")
# Function to train, evaluate and store model metrics
def train_and_evaluate_model(model, model_name, X_train, y_train, X_val, y_val):
# Use stratified cross-validation for training metrics
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
# Start training timer
start_time = time.time()
# Train model
model.fit(X_train, y_train)
# Calculate training time
training_time = time.time() - start_time
# Get cross-validation scores
if hasattr(model, "predict_proba"):
cv_scores = cross_val_score(model, X_train, y_train, cv=cv, scoring='f1', n_jobs=-1)
else:
cv_scores = cross_val_score(model, X_train, y_train, cv=cv, scoring='f1', n_jobs=-1)
# Prediction timing
start_time = time.time()
y_val_pred = model.predict(X_val)
inference_time = time.time() - start_time
# Get probabilities if available
if hasattr(model, "predict_proba"):
y_val_proba = model.predict_proba(X_val)[:, 1]
else:
y_val_proba = None
# Calculate metrics
accuracy = accuracy_score(y_val, y_val_pred)
precision = precision_score(y_val, y_val_pred)
recall = recall_score(y_val, y_val_pred)
f1 = f1_score(y_val, y_val_pred)
# ROC-AUC only if probabilities are available
roc_auc = roc_auc_score(y_val, y_val_proba) if y_val_proba is not None else None
# Calculate PR-AUC and Log Loss if probabilities are available
if y_val_proba is not None:
precision_curve, recall_curve, _ = precision_recall_curve(y_val, y_val_proba)
pr_auc = auc(recall_curve, precision_curve)
log_loss_value = log_loss(y_val, y_val_proba)
else:
pr_auc = None
log_loss_value = None
# Calculate model size (with error handling)
try:
with open('temp_model.pkl', 'wb') as f:
pickle.dump(model, f)
model_size = os.path.getsize('temp_model.pkl') / (1024 * 1024) # Convert bytes to MB
try:
os.remove('temp_model.pkl')
except:
pass # Ignore errors in file deletion
except:
model_size = float('nan') # If file operations fail
# Store results
results = {
'Model': model_name,
'Accuracy': accuracy,
'Precision': precision,
'Recall': recall,
'F1 Score': f1,
'ROC-AUC': roc_auc,
'PR-AUC': pr_auc,
'Log Loss': log_loss_value,
'CV F1 (mean)': cv_scores.mean(),
'CV F1 (std)': cv_scores.std(),
'Training Time (s)': training_time,
'Inference Time (s)': inference_time,
'Model Size (MB)': model_size
}
return results, model
# Dictionary of models for flexible selection
models_dict = {
'Logistic Regression': LogisticRegression(random_state=42),
'Decision Tree': DecisionTreeClassifier(random_state=42),
'Random Forest': RandomForestClassifier(random_state=42, n_jobs=-1),
# 'SVM': SVC(random_state=42, probability=True),
'KNN': KNeighborsClassifier(n_jobs=-1),
'NaĆÆve Bayes': GaussianNB(),
'XGBoost': xgb.XGBClassifier(random_state=42, n_jobs=-1, eval_metric='logloss'),
'LDA': LinearDiscriminantAnalysis(),
'MLP': MLPClassifier(random_state=42)
}
# Train and evaluate each model - EXPLICITLY use the updated y variables
results_list = []
trained_models = {}
for name, model in models_dict.items():
print(f"Training {name}...")
try:
# CRITICAL: Use y_train_out and y_val_out instead of y_train and y_val
result, trained_model = train_and_evaluate_model(
model, name, X_train_final, y_train_out, X_val_final, y_val_out
)
results_list.append(result)
trained_models[name] = trained_model
print(f"Completed {name}")
except Exception as e:
print(f"Error training {name}: {str(e)}")
# Continue with next model if one fails
# Check if we have any results before creating DataFrame
if results_list:
# Create results dataframe
results_df = pd.DataFrame(results_list)
results_df.set_index('Model', inplace=True)
# Display results with cross-validation scores
print("Model Performance with Cross-Validation:")
print(results_df)
# Create a sorted version by F1 score for easy comparison
sorted_results = results_df.sort_values('F1 Score', ascending=False)
print("\nModels ranked by F1 Score:")
print(sorted_results[['F1 Score', 'CV F1 (mean)', 'CV F1 (std)', 'Precision', 'Recall']])
else:
print("No models were successfully trained. Check the error messages above.")
X_train_final shape: (155305, 40)
y_train_out shape: (155305,)
X_val_final shape: (38829, 40)
y_val_out shape: (38829,)
Training Logistic Regression...
Completed Logistic Regression
Training Decision Tree...
Completed Decision Tree
Training Random Forest...
Completed Random Forest
Training KNN...
Completed KNN
Training NaĆÆve Bayes...
Completed NaĆÆve Bayes
Training XGBoost...
Completed XGBoost
Training LDA...
Completed LDA
Training MLP...
Completed MLP
Model Performance with Cross-Validation:
Accuracy Precision Recall F1 Score ROC-AUC PR-AUC Log Loss CV F1 (mean) CV F1 (std) Training Time (s) Inference Time (s) Model Size (MB)
Model
Logistic Regression 0.932010 0.428246 0.498674 0.460784 0.913887 0.437560 0.165848 0.400421 0.011163 0.460003 0.004004 0.001332
Decision Tree 0.709856 0.099680 0.495579 0.165976 0.609345 0.312322 10.457848 0.373394 0.008861 15.315998 0.006999 0.863006
Random Forest 0.936156 0.362832 0.126879 0.188012 0.843693 0.249757 0.286078 0.416791 0.008346 18.021109 0.055038 75.958649
KNN 0.938886 0.468268 0.362069 0.408377 0.830542 0.409469 0.647009 0.430705 0.007753 0.012998 5.441484 48.581213
NaĆÆve Bayes 0.894975 0.215361 0.303714 0.252018 0.726299 0.177344 0.951964 0.366624 0.001954 0.057999 0.024002 0.002119
XGBoost 0.909037 0.242289 0.263926 0.252645 0.842749 0.214116 0.215698 0.463628 0.006029 0.777999 0.021000 0.374887
LDA 0.704757 0.148725 0.861185 0.253646 0.876101 0.368238 1.107817 0.476389 0.004756 0.349511 0.005999 0.002656
MLP 0.942234 0.604396 0.024315 0.046749 0.466487 0.102347 1.561014 0.490386 0.013336 53.697011 0.022002 0.138672
Models ranked by F1 Score:
F1 Score CV F1 (mean) CV F1 (std) Precision Recall
Model
Logistic Regression 0.460784 0.400421 0.011163 0.428246 0.498674
KNN 0.408377 0.430705 0.007753 0.468268 0.362069
LDA 0.253646 0.476389 0.004756 0.148725 0.861185
XGBoost 0.252645 0.463628 0.006029 0.242289 0.263926
NaĆÆve Bayes 0.252018 0.366624 0.001954 0.215361 0.303714
Random Forest 0.188012 0.416791 0.008346 0.362832 0.126879
Decision Tree 0.165976 0.373394 0.008861 0.099680 0.495579
MLP 0.046749 0.490386 0.013336 0.604396 0.024315
5.2 Hyperparameter Tuning for Top Models¶
Let's tune the hyperparameters of our top performing models to improve their performance.
# 5.3 Hyperparameter Tuning for Top Models
# Select top 3 models based on F1 score
top_models = results_df.sort_values('F1 Score', ascending=False).head(3).index.tolist()
print(f"Top 3 models for hyperparameter tuning: {top_models}")
# Hyperparameter grids for each model
param_grids = {
'Logistic Regression': {
'C': [0.01, 0.1, 1, 10, 100],
'penalty': ['l2'],
'solver': ['liblinear', 'saga']
},
'Random Forest': {
'n_estimators': [100, 200, 300],
'max_depth': [None, 10, 20, 30],
'min_samples_split': [2, 5, 10],
'min_samples_leaf': [1, 2, 4]
},
'XGBoost': {
'n_estimators': [100, 200, 300],
'max_depth': [3, 5, 7],
'learning_rate': [0.01, 0.1, 0.2],
'subsample': [0.8, 0.9, 1.0]
},
'SVM': {
'C': [0.1, 1, 10],
'gamma': ['scale', 'auto', 0.1, 0.01],
'kernel': ['rbf', 'linear']
},
'KNN': {
'n_neighbors': [3, 5, 7, 9],
'weights': ['uniform', 'distance'],
'p': [1, 2]
},
'Decision Tree': {
'max_depth': [None, 10, 20, 30],
'min_samples_split': [2, 5, 10],
'min_samples_leaf': [1, 2, 4],
'criterion': ['gini', 'entropy']
},
'MLP': {
'hidden_layer_sizes': [(50,), (100,), (50, 50)],
'activation': ['relu', 'tanh'],
'alpha': [0.0001, 0.001, 0.01]
},
'LDA': {
'solver': ['svd', 'lsqr', 'eigen'],
'shrinkage': [None, 'auto', 0.1, 0.5, 0.9]
},
'NaĆÆve Bayes': {
'var_smoothing': [1e-9, 1e-8, 1e-7, 1e-6]
}
}
# Perform hyperparameter tuning for top models
tuned_models = {}
tuned_results = []
# Initialize the StratifiedKFold for consistent cross-validation
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
for model_name in top_models:
print(f"Tuning {model_name}...")
# Get base model and parameter grid
base_model = models_dict[model_name]
param_grid = param_grids[model_name]
# Create grid search with stratified cross-validation
grid_search = GridSearchCV(
estimator=base_model,
param_grid=param_grid,
scoring='f1',
cv=cv, # Use stratified k-fold
verbose=1,
n_jobs=-1
)
# Fit grid search with the correctly processed data
try:
grid_search.fit(X_train_final, y_train_out)
# Get best model
best_model = grid_search.best_estimator_
tuned_models[model_name] = best_model
# Evaluate tuned model on validation set
result, _ = train_and_evaluate_model(
best_model, f"{model_name} (Tuned)",
X_train_final, y_train_out, X_val_final, y_val_out
)
tuned_results.append(result)
print(f"Best parameters for {model_name}: {grid_search.best_params_}")
print(f"Best cross-validation F1 score: {grid_search.best_score_:.4f}")
print(f"Completed tuning {model_name}")
except Exception as e:
print(f"Error tuning {model_name}: {str(e)}")
continue
# Check if any models were successfully tuned
if tuned_results:
# Create tuned results dataframe
tuned_results_df = pd.DataFrame(tuned_results)
tuned_results_df.set_index('Model', inplace=True)
print("\nTuned Model Performance:")
print(tuned_results_df)
# Compare with original models
comparison_models = []
for model_name in top_models:
if model_name in results_df.index and f"{model_name} (Tuned)" in tuned_results_df.index:
original_f1 = results_df.loc[model_name, 'F1 Score']
tuned_f1 = tuned_results_df.loc[f"{model_name} (Tuned)", 'F1 Score']
improvement = ((tuned_f1 - original_f1) / original_f1) * 100
comparison_models.append({
'Model': model_name,
'Original F1': original_f1,
'Tuned F1': tuned_f1,
'Improvement (%)': improvement
})
if comparison_models:
comparison_df = pd.DataFrame(comparison_models)
print("\nPerformance Improvement After Tuning:")
print(comparison_df)
else:
print("No models were successfully tuned. Check the error messages above.")
Top 3 models for hyperparameter tuning: ['Logistic Regression', 'KNN', 'LDA']
Tuning Logistic Regression...
Fitting 5 folds for each of 10 candidates, totalling 50 fits
Best parameters for Logistic Regression: {'C': 100, 'penalty': 'l2', 'solver': 'saga'}
Best cross-validation F1 score: 0.4015
Completed tuning Logistic Regression
Tuning KNN...
Fitting 5 folds for each of 16 candidates, totalling 80 fits
Best parameters for KNN: {'n_neighbors': 7, 'p': 1, 'weights': 'uniform'}
Best cross-validation F1 score: 0.4357
Completed tuning KNN
Tuning LDA...
Fitting 5 folds for each of 15 candidates, totalling 75 fits
Best parameters for LDA: {'shrinkage': None, 'solver': 'svd'}
Best cross-validation F1 score: 0.4764
Completed tuning LDA
Tuned Model Performance:
Accuracy Precision Recall F1 Score ROC-AUC PR-AUC Log Loss CV F1 (mean) CV F1 (std) Training Time (s) Inference Time (s) Model Size (MB)
Model
Logistic Regression (Tuned) 0.924592 0.395085 0.554377 0.461369 0.912713 0.433808 0.179540 0.401456 0.011360 5.922000 0.003002 0.001325
KNN (Tuned) 0.946406 0.623129 0.202476 0.305639 0.821283 0.401428 0.610199 0.435672 0.009825 0.012002 27.036997 48.581213
LDA (Tuned) 0.704757 0.148725 0.861185 0.253646 0.876101 0.368238 1.107817 0.476389 0.004756 0.293998 0.003008 0.002656
Performance Improvement After Tuning:
Model Original F1 Tuned F1 Improvement (%)
0 Logistic Regression 0.460784 0.461369 0.126814
1 KNN 0.408377 0.305639 -25.157636
2 LDA 0.253646 0.253646 0.000000
5.3 Build Ensemble Model¶
# Create ensemble model using the top 3 tuned models
ensemble_models = []
for name in top_models:
if name in tuned_models:
ensemble_models.append((name, tuned_models[name]))
# Check if we have models to ensemble
if len(ensemble_models) >= 2:
print(f"Building ensemble with {len(ensemble_models)} models: {[name for name, _ in ensemble_models]}")
# Create and train voting classifier
ensemble = VotingClassifier(
estimators=ensemble_models,
voting='soft' # Use predicted probabilities
)
# Train and evaluate ensemble using correct datasets
try:
ensemble_result, ensemble_model = train_and_evaluate_model(
ensemble, "Ensemble",
X_train_final, y_train_out, X_val_final, y_val_out
)
# Add ensemble result to tuned results
ensemble_df = pd.DataFrame([ensemble_result]).set_index('Model')
final_results = pd.concat([tuned_results_df, ensemble_df])
print("\nEnsemble Model Performance:")
print(ensemble_df)
print("\nAll Models Performance (including Ensemble):")
print(final_results.sort_values('F1 Score', ascending=False))
except Exception as e:
print(f"Error building ensemble: {str(e)}")
else:
print(f"Not enough tuned models to build an ensemble. Need at least 2, but only have {len(ensemble_models)}.")
# If we have tuned models, just display those
if tuned_results:
final_results = tuned_results_df
print("\nTuned Models Performance:")
print(final_results.sort_values('F1 Score', ascending=False))
Building ensemble with 3 models: ['Logistic Regression', 'KNN', 'LDA']
Ensemble Model Performance:
Accuracy Precision Recall F1 Score ROC-AUC PR-AUC Log Loss CV F1 (mean) CV F1 (std) Training Time (s) Inference Time (s) Model Size (MB)
Model
Ensemble 0.908573 0.348684 0.656057 0.455354 0.902643 0.46455 0.227501 0.468879 0.005031 6.343002 26.583818 97.16736
All Models Performance (including Ensemble):
Accuracy Precision Recall F1 Score ROC-AUC PR-AUC Log Loss CV F1 (mean) CV F1 (std) Training Time (s) Inference Time (s) Model Size (MB)
Model
Logistic Regression (Tuned) 0.924592 0.395085 0.554377 0.461369 0.912713 0.433808 0.179540 0.401456 0.011360 5.922000 0.003002 0.001325
Ensemble 0.908573 0.348684 0.656057 0.455354 0.902643 0.464550 0.227501 0.468879 0.005031 6.343002 26.583818 97.167360
KNN (Tuned) 0.946406 0.623129 0.202476 0.305639 0.821283 0.401428 0.610199 0.435672 0.009825 0.012002 27.036997 48.581213
LDA (Tuned) 0.704757 0.148725 0.861185 0.253646 0.876101 0.368238 1.107817 0.476389 0.004756 0.293998 0.003008 0.002656
6. Final Model Evaluation¶
Let's evaluate our final models on the test set to get unbiased performance estimates.
# Function to evaluate model on test set
def evaluate_on_test(model, model_name, X_test, y_test):
# Make predictions
y_test_pred = model.predict(X_test)
# Get probabilities if available
if hasattr(model, "predict_proba"):
y_test_proba = model.predict_proba(X_test)[:, 1]
else:
y_test_proba = None
# Calculate metrics
accuracy = accuracy_score(y_test, y_test_pred)
precision = precision_score(y_test, y_test_pred)
recall = recall_score(y_test, y_test_pred)
f1 = f1_score(y_test, y_test_pred)
# ROC-AUC only if probabilities are available
roc_auc = roc_auc_score(y_test, y_test_proba) if y_test_proba is not None else None
# Calculate PR-AUC and Log Loss if probabilities are available
if y_test_proba is not None:
precision_curve, recall_curve, _ = precision_recall_curve(y_test, y_test_proba)
pr_auc = auc(recall_curve, precision_curve)
log_loss_value = log_loss(y_test, y_test_proba)
else:
pr_auc = None
log_loss_value = None
# Store results
results = {
'Model': model_name,
'Accuracy': accuracy,
'Precision': precision,
'Recall': recall,
'F1 Score': f1,
'ROC-AUC': roc_auc,
'PR-AUC': pr_auc,
'Log Loss': log_loss_value
}
return results, y_test_pred, y_test_proba
# Evaluate top models and ensemble on test set
test_results = []
test_predictions = {}
test_probabilities = {}
# Check which models we have available to evaluate
models_to_evaluate = {}
# Add tuned models
for model_name in tuned_models:
models_to_evaluate[f"{model_name} (Tuned)"] = tuned_models[model_name]
# Add ensemble if available
if 'ensemble_model' in locals():
models_to_evaluate["Ensemble"] = ensemble_model
print(f"Evaluating {len(models_to_evaluate)} models on the test set...")
# Evaluate each model
for model_name, model in models_to_evaluate.items():
try:
print(f"Evaluating {model_name}...")
result, y_pred, y_proba = evaluate_on_test(
model, model_name, X_test_final, y_test_out
)
test_results.append(result)
test_predictions[model_name] = y_pred
test_probabilities[model_name] = y_proba
print(f"Completed evaluation of {model_name}")
except Exception as e:
print(f"Error evaluating {model_name}: {str(e)}")
# Create test results dataframe
if test_results:
test_results_df = pd.DataFrame(test_results)
test_results_df.set_index('Model', inplace=True)
print("\nTest Set Performance:")
print(test_results_df)
# Sort by F1 score for easy comparison
sorted_test_results = test_results_df.sort_values('F1 Score', ascending=False)
print("\nModels ranked by Test F1 Score:")
print(sorted_test_results[['F1 Score', 'Precision', 'Recall', 'Accuracy']])
# Identify best model
best_model_name = sorted_test_results.index[0]
print(f"\nBest performing model on test set: {best_model_name}")
for metric in ['F1 Score', 'Precision', 'Recall', 'Accuracy', 'ROC-AUC']:
if metric in sorted_test_results.columns:
print(f"{metric}: {sorted_test_results.loc[best_model_name, metric]:.4f}")
else:
print("No models were successfully evaluated on the test set.")
Evaluating 4 models on the test set...
Evaluating Logistic Regression (Tuned)...
Completed evaluation of Logistic Regression (Tuned)
Evaluating KNN (Tuned)...
Completed evaluation of KNN (Tuned)
Evaluating LDA (Tuned)...
Completed evaluation of LDA (Tuned)
Evaluating Ensemble...
Completed evaluation of Ensemble
Test Set Performance:
Accuracy Precision Recall F1 Score ROC-AUC PR-AUC Log Loss
Model
Logistic Regression (Tuned) 0.943665 0.553529 0.288607 0.379398 0.908615 0.434577 0.154759
KNN (Tuned) 0.943959 0.592197 0.195105 0.293510 0.808598 0.393574 0.675675
LDA (Tuned) 0.729239 0.159965 0.832189 0.268347 0.872784 0.378184 0.947044
Ensemble 0.935543 0.461732 0.484416 0.472802 0.895087 0.459025 0.204953
Models ranked by Test F1 Score:
F1 Score Precision Recall Accuracy
Model
Ensemble 0.472802 0.461732 0.484416 0.935543
Logistic Regression (Tuned) 0.379398 0.553529 0.288607 0.943665
KNN (Tuned) 0.293510 0.592197 0.195105 0.943959
LDA (Tuned) 0.268347 0.159965 0.832189 0.729239
Best performing model on test set: Ensemble
F1 Score: 0.4728
Precision: 0.4617
Recall: 0.4844
Accuracy: 0.9355
ROC-AUC: 0.8951
Let's visualize the performance of our top models on the test set.
# Visualize test results - ROC curves
if test_results and len(test_probabilities) > 0:
plt.figure(figsize=(10, 8))
for model_name, y_pred_proba in test_probabilities.items():
if y_pred_proba is not None:
fpr, tpr, _ = roc_curve(y_test_out, y_pred_proba)
roc_auc = auc(fpr, tpr)
plt.plot(fpr, tpr, label=f'{model_name} (AUC = {roc_auc:.3f})')
plt.plot([0, 1], [0, 1], linestyle='--', color='gray', label='Random Chance')
plt.xlabel('False Positive Rate', fontsize=12)
plt.ylabel('True Positive Rate', fontsize=12)
plt.title('ROC Curves for Models on Test Set', fontsize=15)
plt.legend(loc='lower right')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
# Precision-Recall curves
plt.figure(figsize=(10, 8))
for model_name, y_pred_proba in test_probabilities.items():
if y_pred_proba is not None:
precision, recall, _ = precision_recall_curve(y_test_out, y_pred_proba)
pr_auc = auc(recall, precision)
plt.plot(recall, precision, label=f'{model_name} (AUC = {pr_auc:.3f})')
# Add baseline
no_skill = sum(y_test_out) / len(y_test_out)
plt.axhline(y=no_skill, linestyle='--', color='gray', label='Baseline')
plt.xlabel('Recall', fontsize=12)
plt.ylabel('Precision', fontsize=12)
plt.title('Precision-Recall Curves for Models on Test Set', fontsize=15)
plt.legend(loc='best')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
7. Feature Importance Analysis¶
Let's analyze which features are most important for our best model.
# Check if we have test_results_df
if 'test_results_df' not in locals() or test_results_df.empty:
print("No test results available for feature importance analysis.")
else:
# Determine best model based on test F1 score
best_model_name = test_results_df['F1 Score'].idxmax()
print(f"Best model based on test F1 score: {best_model_name}")
# Get the best model
best_model = None
# Handle model name cases properly
if best_model_name in models_to_evaluate:
best_model = models_to_evaluate[best_model_name]
print(f"Successfully retrieved model: {best_model_name}")
else:
print(f"Could not find model: {best_model_name}")
# Only proceed if we have a model
if best_model is not None:
# Check if model is tree-based or has feature_importances_
if hasattr(best_model, 'feature_importances_'):
# Direct access for tree-based models
importances = best_model.feature_importances_
feature_names = X_train_final.columns
# Create feature importance dataframe
feature_importance_df = pd.DataFrame({
'Feature': feature_names,
'Importance': importances
})
# Sort by importance
feature_importance_df = feature_importance_df.sort_values('Importance', ascending=False)
# Plot top 20 most important features
plt.figure(figsize=(12, 10))
top_features = feature_importance_df.head(20)
ax = sns.barplot(x='Importance', y='Feature', data=top_features)
plt.title(f'Top 20 Feature Importances for {best_model_name}', fontsize=15)
plt.tight_layout()
plt.show()
print("Top 10 most important features:")
print(feature_importance_df.head(10))
# Handle VotingClassifier ensemble models
elif hasattr(best_model, 'estimators'): # Note: estimators not estimators_
print("Best model is a VotingClassifier ensemble. Analyzing feature importance of its components.")
# Access estimators differently in VotingClassifier
for i, estimator in enumerate(best_model.estimators):
# For VotingClassifier, estimators is a list of models, not (name, model) tuples
if hasattr(estimator, 'feature_importances_'):
# Get the name based on the model type
name = type(estimator).__name__
importances = estimator.feature_importances_
feature_names = X_train_final.columns
# Create feature importance dataframe
feature_importance_df = pd.DataFrame({
'Feature': feature_names,
'Importance': importances
})
# Sort by importance
feature_importance_df = feature_importance_df.sort_values('Importance', ascending=False)
# Plot top 20 most important features
plt.figure(figsize=(12, 10))
top_features = feature_importance_df.head(20)
ax = sns.barplot(x='Importance', y='Feature', data=top_features)
plt.title(f'Top 20 Feature Importances for Ensemble Component {i+1}: {name}', fontsize=15)
plt.tight_layout()
plt.show()
print(f"Top 10 most important features for ensemble component {i+1} ({name}):")
print(feature_importance_df.head(10))
break # Just use the first tree-based model
# For models without direct feature_importances_
else:
print(f"Model {best_model_name} does not directly expose feature importances.")
try:
# Try to get coefficients for linear models
if hasattr(best_model, 'coef_'):
coef = best_model.coef_[0] if best_model.coef_.ndim > 1 else best_model.coef_
feature_names = X_train_final.columns
# Create feature importance dataframe
feature_importance_df = pd.DataFrame({
'Feature': feature_names,
'Coefficient': coef
})
# Sort by absolute coefficient value
feature_importance_df['Abs_Coefficient'] = feature_importance_df['Coefficient'].abs()
feature_importance_df = feature_importance_df.sort_values('Abs_Coefficient', ascending=False)
# Plot top 20 most important features
plt.figure(figsize=(12, 10))
top_features = feature_importance_df.head(20)
ax = sns.barplot(x='Coefficient', y='Feature', data=top_features)
plt.title(f'Top 20 Feature Coefficients for {best_model_name}', fontsize=15)
plt.tight_layout()
plt.show()
print("Top 10 most important features by coefficient magnitude:")
print(feature_importance_df[['Feature', 'Coefficient']].head(10))
else:
print("Consider using permutation importance or SHAP values for this model type.")
except Exception as e:
print(f"Error extracting feature importance: {str(e)}")
print("Consider using permutation importance or SHAP values for this model type.")
Best model based on test F1 score: Ensemble Successfully retrieved model: Ensemble Best model is a VotingClassifier ensemble. Analyzing feature importance of its components.
# Extract feature importance from the Logistic Regression tuned model
if 'Logistic Regression (Tuned)' in models_to_evaluate:
# Get the model
lr_model = models_to_evaluate['Logistic Regression (Tuned)']
# Extract coefficients
if hasattr(lr_model, 'coef_'):
coef = lr_model.coef_[0] if lr_model.coef_.ndim > 1 else lr_model.coef_
feature_names = X_train_final.columns
# Create feature importance dataframe
lr_importance_df = pd.DataFrame({
'Feature': feature_names,
'Coefficient': coef
})
# Sort by absolute coefficient value
lr_importance_df['Abs_Coefficient'] = lr_importance_df['Coefficient'].abs()
lr_importance_df = lr_importance_df.sort_values('Abs_Coefficient', ascending=False)
# Plot top 20 most important features
plt.figure(figsize=(12, 10))
top_features = lr_importance_df.head(20)
sns.barplot(x='Coefficient', y='Feature', data=top_features)
plt.title('Top 20 Feature Coefficients for Logistic Regression (Tuned)', fontsize=15)
plt.tight_layout()
plt.show()
print("Top 10 most important features by coefficient magnitude:")
print(lr_importance_df[['Feature', 'Coefficient']].head(10))
else:
print("Logistic Regression model doesn't have coefficients attribute.")
else:
print("Logistic Regression (Tuned) model not found in the evaluated models.")
Top 10 most important features by coefficient magnitude: Feature Coefficient 0 PC1 1.591328 15 PC16 1.385211 14 PC15 -1.202561 19 PC20 1.013046 21 PC22 -0.975054 18 PC19 0.792207 35 PC36 -0.764262 34 PC35 -0.746540 12 PC13 0.613604 4 PC5 -0.569819